To extract reads from the Illumina sequencer FASTQ format files, you first need to enter your sample information into the Sample Description file. This template file already contains all the indexing information based on the index plate positions. You just need to add the sample names, save the file as tab-delimited text, then start the program.
Convert Sequencing Data to FASTQ Format
To use the DriverMap Sample Extraction Software, the Illumina sequencing data much be converted to FASTQ format. To do this, run Illumina’s bcl2fastq program on a computer running Red Hat Enterprise Linux 6 or CentOS 6 (see Illumina for full installation requirements).
Note: Do not use the bcl2fastq Program to demultiplex the run data. The DriverMap Data Alignment Software will demultiplex the run when it does the alignment. If you demultiplex the run with bcl2fastq Program you will generate a separate FASTQ file for each indexed sample and would need to run the DriverMap Alignment Program separately to align each one.
- Use the following command line to convert the Illumina bcl2fastq intensity reads to FASTQ sequence data:
$ bcl2fastq --runfolder-dir $folder --create-fastq-for-index-reads --ignore-missing-bcls --minimum-trimmed-read-length 0 --mask-short-adapter-reads 0
- Make sure to use the “—create-fastq-for-index-reads” option to generate both the I1 and I2 FASTQ files, needed for Cellecta’s DriverMap alignment software.
- The “$folder” parameter is the path to the folder containing data for the NGS run. At the end of the bcl2fastq conversion, you should have find in the $folder a list of non-demultiplexed FASTQ files similar to the following set of files that will be read by the Alignment Software:
Note: The SampleSheet.csv file should not be in the destination folder for the FASTQ files. The bcl2fasq files and FASTQ files need to be in separate folders to ensure the Alignment Program runs correctly.
Set-up the Sample Description File
- Open the template file “DriverMap-Sample-Description-Form.xlsm”. You will enter your sample names into this template, and then save the file with a name descriptive of your experiment.
Note: Please make sure to enable macros in Excel, if prompted.
- The Sample Description Form has four columns. Enter the experimental sample names in the second column based on their positions in the index plate (as noted in the first column). We recommend using standard alphanumeric characters. If you would like to include sample numbers as well as descriptions for the samples, just add these to the sample names. The software will only extract sequencing data corresponding to rows with sample names. The program will ignore rows with the sample field left blank. Sequences corresponding to indexes of fields without names will not be extracted.
Note: If you used only a fraction of the wells in the index plate (e.g., to run less than 96 samples), be sure to enter the sample information into the field that references the well location relative to the whole plate. In other words, if you cut the plate in half and ran columns 1-6 before and then, in this experiment ran the remaining 6 columns (columns 7-12), you need to enter the sample names in the field that denotes the well positions for the second part of the plate (i.e., starting with 7A, 7B, 7C,….12G, 12H).
Note: If you choose to generate individual demultiplexed FASTQ files for each sample (see below), the plate coordinates (column 1) will be used as the file names for the individual FASTQ files.
Note: If you do not have Microsoft Excel on your computer, you can enter the information directly into the included “sample-description.txt” file using another suitable application. Be careful to make sure the table is not altered in any way and the indexes are not changed.
- Click the button to save the sample table in tab-text format. Choose an appropriate name for the file as desired.
Run the Data Alignment Program
To run the program, you simply need to select the DriverMap Assay Kit you are using, the Sample Description File (.txt format as saved above), and the FASTQ file. This is done simply by selecting the appropriate files using the program interface.
- Select the DriverMap Assay Kit used in the experiment. The Program defaults to the Human DriverMap Genome-Wide Assay Kit. If the data was generated using this assay, then leave the selection as-is, otherwise, select the file corresponding to the DriverMap Assay used.
- Click the button to select the Sample Description text file created above.
- Select the folder with the FASTQ files for the experiment.
Note: All FASTQ files for the experiment must be decompressed and in the same folder. The Program does not accept compressed (.fastq.gz) files. Decompress them to .fastq. The Program identifies each file using the Illumina NextSeq naming conventions: four files per lane, labeled Undetermined_S0_L001_I1_001.fastq, Undetermined_S0_L001_I2_001.fastq, etc.
- At this point, you have the option to have the program generate separate demultiplexed FASTQ files for each of the individual samples. This will require disk space equal to the size of the FASTQ read files from the sequencer and generating these files increases processing time for the program.
Note: If the option to generate FASTQ files is enabled, ensure there is sufficient local disk space available to store the resulting output. The size of the individual FASTQ files output will be approximately equal to the input FASTQ.
- There is the option to change the sequence length scored and the Hamming distance used to identify targets. These parameters should only be changed if sequencing was done using other than the standard 75-cycle Illumina sequencing kit. Normally, these parameters should not be altered.
- Click the Start Alignment button to begin scoring the sequence data. Processing may take an hour or more. The program displays a progress indicator with information regarding the number of records processed while running. When complete, a message will appear on the screen.
Note: Clicking the _Exit _ button or closing the program window will end the program prematurely. A warning message will appear to alert you that the program will close.
The Program outputs two tab-delimited text files. These files can be uploaded into Excel or other spreadsheet software. Both files include the name of the Sample Description File with the extension shown below:
- The _“sample-description-filename”__Statistics.txt file contains the table of total aligned and unaligned (i.e. background) counts for each sample.
- The _“sample-description-filename”__Alignment.txt file has a table of aligned counts for each sample (in columns) and target (in rows).
Optionally, as indicated above, separate FASTQ files can be generated for each of the input samples. These may be used by other analysis software.
All output files will be saved in the same directory as the sample description file.
Need more help with this?