Strange and Mysterious File Types (you might encounter)
Sequence Read Archive format (SRA): This is an NCBI specific file format used because of its ability to compress read sequence information. This is often the output of many illumina sequencing pipelines.
Fastq file: SRA files can be converted to Fastq files, these are similar to Fatsta files and contain a header, associated genomic sequence and a quality score for the sequence. This is often encoded in binary and needs to be read by quality control algorithms. Thus, Fastq files contain your raw sequence information.
BAM and SAM Files: These are your alignment files where SAM stands for sequence alignment file and BAM is the binary form of this. While BAM is unreadable by humans it is often used because it is more memory efficient and quicker for computer algorithms to read in. These are obtained by often by Bowtie or Tophat after your Fastq reads have been aligned to your reference genome
GTF/GFF: These are often used as reference for counting how many reads map to genomic regions. These are tab delimited files containing the , start, stop, chromosome, and strand information along with name of a genomic region (such as gene, transposon, mRNA).
RNASeq process in flow chart on top. Bottom indicated different file formats and types of analysis for each step. |
Pre-processing:
If files are obtained from NCBI’s Gene Expression Omnibus (geo) then they are most likely in .SRA format and need to be converted to fastq using the fastaq-dump package.
Quality control:
Before processing the data reads must first be trimmed for adapter sequences and reads of bad quality should be filtered out. A great tool for easily viewing the distribution of quality scores is FASTQC. I use trimreads to both filter low quality reads and trim adaptor sequences. For more detailed analysis others have used many components of the fastx toolkit.
Alignment to reference genome:
Your fastq file contains your RNA small sequence reads. These need to be aligned back to a reference genome. Reference genome: Model organism you are using, sequence containing coding sequence only, microRNA only or entire genome. BOWTIE excels both in speed and memory efficiency for aligning short sequence reads back to a long reference sequence. In order to obtain this efficiency the reference genome must first be indexed. This creates a database of keys (in this case they would be cds sequences) for each record and positions them into a memory efficient tree. An index is created simply by:
bowtie-build reference_genome name_for_refrence_genome
TopHat/Bowtie can then be used to align the fastq sequences to the index reads. TopHat is often preferred in RNAseq analysis because while it is build on top of Bowtie it is also able to identify splice junctions between exons.
Tophat --bowtie1 name_for_refrence_genome fastq_reads
Determining differentially expressed gene:
Once files have been aligned to a reference genome count data can be obtained.
Count Data: These are counts of RNASeq reads that align to a gene (or other genomic regions indicated by the GTF file). Therefore this tab-delimited file contains gene_ids and their corresponding RNA counts from your data.
To analyze for differential expression two programs are commonly used DeSeq or EdgeR. You can read up on comparisons of these two here:
- A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.
- Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing
- DESeq vs edgeR Comparison
Both programs assume that the number of reads in the sample can be modeled using a negative bionomial distribution.
Additional Resources: