Wednesday, May 1, 2013

RNASeq Analysis: The Basics


Strange and Mysterious File Types (you might encounter)


Sequence Read Archive format (SRA): This is an NCBI specific file format used because of its ability to compress read sequence information. This is often the output of many illumina sequencing pipelines.


Fastq file: SRA files can be converted to Fastq files, these are similar to Fatsta files and contain a header, associated genomic sequence and a quality score for the sequence. This is often encoded in binary and needs to be read by quality control algorithms. Thus, Fastq files contain your raw sequence information.


BAM and SAM Files: These are your alignment files where SAM stands for sequence alignment file and BAM is the binary form of this. While BAM is unreadable by humans it is often used because it is more memory efficient and quicker for computer algorithms to read in. These are obtained by often by Bowtie or Tophat after your Fastq reads have been aligned to your reference genome


GTF/GFF: These are often used as reference for counting how many reads map to genomic regions. These are tab delimited files containing the , start, stop, chromosome, and strand information along with name of a genomic region (such as gene, transposon, mRNA).



RNASeq process in flow chart on top. Bottom indicated different file formats and types of analysis for each step.


Pre-processing:

If files are obtained from  NCBI’s Gene  Expression Omnibus (geo) then they are most likely in .SRA format and need to be converted to fastq using the fastaq-dump package.


Quality control:



Before processing the data reads must first be trimmed for adapter sequences and reads of bad quality should be filtered out. A great tool for easily viewing the distribution of  quality scores is FASTQC. I use trimreads to both filter low quality reads and trim adaptor sequences. For more detailed analysis others have used many components of the fastx toolkit.


Alignment to reference genome:

Your fastq file contains your RNA small sequence reads. These need to be aligned back to a reference genome. Reference genome: Model organism you are using, sequence containing coding sequence only, microRNA only or entire genome. BOWTIE excels both in speed and memory efficiency for aligning short sequence reads back to a long reference sequence. In order to obtain this efficiency the reference genome must first be indexed. This creates a database of keys (in this case they would be cds sequences) for each record and positions them into a memory efficient tree. An index is created simply by:
bowtie-build reference_genome name_for_refrence_genome
TopHat/Bowtie can then be used to align the fastq sequences to the index reads. TopHat is often preferred in RNAseq analysis because while it is build on top of Bowtie it is also able to identify splice junctions between exons.
Tophat --bowtie1 name_for_refrence_genome fastq_reads
Determining differentially expressed gene:

Once files have been aligned to a reference genome count data can be obtained.

Count Data: These are counts of RNASeq reads that align to a gene (or other genomic regions indicated by the GTF file). Therefore this tab-delimited file contains gene_ids and their corresponding RNA counts from your data.

To analyze for differential expression two programs are commonly used DeSeq or EdgeR. You can read up on comparisons of these two here:



Both programs assume that the number of reads in the sample can be modeled using a negative bionomial distribution.






Additional Resources:


5 comments:

  1. This post is so useful and makes a bit of extremely pleasant data on the theme about file types. It is the first occasion when I visit your blog, yet I was amazingly dazzled. Continue posting. microsoft office professional plus 2019

    ReplyDelete
  2. I agree with a lot of the points you made in this article of Software . I appreciate the work you have put into this and hope you continue writing on this subject. microsoft office professional plus 2019 uk

    ReplyDelete
  3. Really appreciate this wonderful post that you have provided for us. Great site and a great topic as well. I am really amazed to read this. It's really good. For the further information you can visit on. office professional plus 2016

    ReplyDelete
  4. This blog explains the details about Software. This gives the details of the thinking next what to do. All that are discussed and provide a grateful talk. microsoft office 2019 professional plus

    ReplyDelete
  5. Once the varied software supplies are scanned, the rest of|the the rest of} the process is finished electronically, bettering efficiency. A gambling drawback can take over your whole life, affecting 온라인카지노 your family ones}, your job, your friendships, and your personal self-worth. We have skilled, skilled counselors find a way to|you probably can} talk to, heaps of|and plenty of} different sources.

    ReplyDelete