transcript assembly and quantification from rnaseq data
play

Transcript Assembly and Quantification from RNASeq Data Angelika - PowerPoint PPT Presentation

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David Gonzales-Knowles Centre de Regulacio Genomica (CRG), Barcelona, Spain COST RNASeq workshop, Uppsala (May 2012) Tuesday, May 22, 2012 Why RNASeq... its


  1. Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David Gonzales-Knowles Centre de Regulacio Genomica (CRG), Barcelona, Spain COST RNASeq workshop, Uppsala (May 2012) Tuesday, May 22, 2012

  2. Why RNASeq... • its amazing! • representing steady state RNA abundance at a dynamic range • quantifying alternative splicing • de novo splice junction and element detection Tuesday, May 22, 2012

  3. RNASeq Data • illustrate the “volume”: • give some specs on current RNAseq datasets i.e. CLL, ENCODE, Illumina... • amount: number of sets in the lab Tuesday, May 22, 2012

  4. Software • Mapping: Tophat, GEM • Transcript assembly: Cufflinks • Transcript quantification: FluxCapacitor Tuesday, May 22, 2012

  5. Mapping I - Tophat • ab initio by large-scale mapping of RNA-Seq reads. TopHat maps reads to splice sites in a mammalian genome at a rate of ∼ 2.2 million reads per CPU hour • TopHat first maps non-junction reads (those contained within exons) using Bowtie (http://bowtie-bio.sourceforge.net), an ultra-fast short-read mapping program (Langmead et al. , 2009). [2 mism, up to 10 multimaps] -> initially unmapped reads • The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences. Trapnell, Pachter & Salzberg (2009) Tuesday, May 22, 2012

  6. Tophat seeding.. The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3 ′ ends, TopHat only examines the first 28 bp on the 5 ′ end of each read by default Tuesday, May 22, 2012

  7. • lates release: TopHat 2.0.0 release 4/09/2012 • Tophat homepage: http://tophat.cbcb.umd.edu/ Tuesday, May 22, 2012

  8. GEM • pipeline for RNAseq: • initital mapping with GEM short read mapper (genome, transcriptome) • unaligned reads mapped with GEM splitmapper • recursive mapping with trimmed reads and increased no. of mismatches to improve Tuesday, May 22, 2012

  9. GEM split-mapper • short description + picture (hopefully Paolo is sending me something) Tuesday, May 22, 2012

  10. • latest release: ? • GEM homepage: http://sourceforge.net/apps/mediawiki/ gemlibrary/ Tuesday, May 22, 2012

  11. Transcript assembly+ quantification • Genes sometimes have multiple alternative splicing events, and there may be many possible reconstructions of the gene model that explain the sequencing data. In fact, it is often not obvious how many splice variants of the gene may be present. Thus, Cufflinks reports a parsimonious transcriptome assembly of the data. The algorithm reports as few full-length transcript fragments or 'transfrags' as are needed to 'explain' all the splicing event outcomes in the input data. • Issues are the same for the FluxCapacitor. Tuesday, May 22, 2012

  12. Cufflinks • Overview of Cufflinks. The algorithm takes as input cDNA fragment sequences that have been ( a ) aligned to the genome by software capable of producing spliced alignments, such as TopHat. With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping ‘bundles’ of fragment alignments ( b-c ) separately, which reduces running time and memory use because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts ( d-e ). ( b ) The first step in fragment assembly is to identify pairs of ‘incompatible’ fragments that must have originated from distinct spliced mRNA isoforms. Fragments are connected in an ‘overlap graph’ when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue, and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. ( c ) Assembling isoforms from the overlap graph. Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally ‘covered’ by three paths, each representing a different isoform. Dilworth's Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to “explain” all the fragments. Cufflinks implements a proof of Dilworth's Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. ( d ) Estimating transcript abundance. Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks can incorporate the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to Cufflinks' model, if it were to come from the red isoform instead of the blue isoform. ( e ) The program then numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( γ 1 , γ 2 , γ 3 ), producing the abundances that best explain the observed fragments, shown as a pie chart Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Trapnell (2010) Tuesday, May 22, 2012

  13. • latest release: Cufflinks 2.0.0 release 5/4/2012 (includes cuffmerge, cuffcomp, cuffdiff) • Cufflinks homepage: http://cufflinks.cbcb.umd.edu/ • Workshop at Berkley, June 30th (cufflinks and eXpress)- online viewing 50$ http://qb3.berkeley.edu/qb3/starseq/ Tuesday, May 22, 2012

  14. FluxCapcitor The basic problem addressed by the FLUX CAPACITOR . The exonic structure of two spliceforms (labeled as "SF A" and "SF B") is shown, with aligned reads from by RNAseq methods (top) . Those reads mapped to the edges of a splicing graph (bottom) represent a signal, measured as the FLUX - the relative coverage along an exonic stretch. Where transcripts overlap in exons, their respective flux is combined. Given the information from all edges in a locus, signal separation is achieved by decomposition across a flow network Reference: Montgomery et al. 2010 Tuesday, May 22, 2012

  15. The assignment of reads — after having mapped them to genomic locations — is not straightforward. The Flux Capacitor follows a conservative annotation assignment,i.e., reads are assigned uniquely to genomic regions („segments” or ,,junctions). These regions are defined given the exon-intron structure of each locus, an example is shown in Fig.1. Fig.1 : An example locus with two transcripts I and II (names to the left) that overlap in segments of their exons (green boxes denoted by letters A through E, indices indicate segments of overlapping exons). The Flux Capacitor distinguishes further 5 non-exonic areas. 19 sequencing reads (arrows with heart labels) have been mapped in the arrea of the locus as shown. http://fluxcapacitor.wikidot.com/capacitor Tuesday, May 22, 2012

  16. • latest release:? • FluxCapacitor homepage: http://flux.sammeth.net/capacitor.html • video: http://www.scivee.tv/node/10013 Tuesday, May 22, 2012

Recommend


More recommend