Transcriptomics 101 Nicole Cloonan Winter School, 5 th July 2011
Transcriptional Complexity Mutations Allelic Expression RNA Editing TSS TSS TSS pA pA pA ATG ATG pA TSS PASR miRNA TASR tiRNA AAA ATG AAA ATG ATG AAA ATG AAA ATG AAA ATG ATG AAA AAA genomic DNA microRNAs spliced intron TSS pA polyadenylation signal transcription start site protein coding regions AAA polyadenylation ATG translation start site non-coding regions
Presentation Outline RNAseq Introduction RNAseq post-mapping miRNAseq Conclusions analysis Transcriptional RNAseq Uniqueome isomiRs Things to complexity Mapping consider How to Information Surveying Novel exon- measure a content Take home transcriptional junction transcript messages complexity with discovery Expression microarrays Transcript Thresholding assembly RNA-seq
Tag sequencing TSS TSS TSS pA pA pA ATG ATG pA TSS AAA ATG AAA ATG ATG AAA ATG AAA ATG AAA ATG ATG AAA AAA SAGE CAGE MPSS PET
Microarrays TSS TSS TSS pA pA pA ATG ATG pA TSS ATG AAA ATG ATG AAA AAA ATG AAA ATG AAA ATG ATG AAA AAA microarray exon arrays exon-junction arrays
RNAseq TSS TSS TSS pA pA pA ATG ATG pA TSS AAA ATG AAA ATG ATG AAA ATG AAA ATG AAA ATG ATG AAA AAA Cloonan et al . Nat Methods 2008 ; 5:613-619
Presentation Outline RNAseq Introduction RNAseq post-mapping miRNAseq Conclusions analysis Transcriptional RNAseq Uniqueome isomiRs Things to complexity Mapping consider How to Information Surveying Novel exon- measure a content Take home transcriptional junction transcript messages complexity with discovery Expression microarrays Transcript Thresholding assembly RNA-seq
RNAseq Mapping TSS TSS TSS pA pA pA ATG ATG pA TSS ATG AAA The fastest alignment methods are ungapped … but what about junctions? genomic DNA microRNAs spliced intron TSS pA polyadenylation signal transcription start site protein coding regions AAA polyadenylation ATG translation start site non-coding regions
Novel exon-junction discovery (systematic) TSS TSS TSS pA pA pA ATG ATG Pros: Cons: Computationally easy Does not find all novel splicing genomic DNA microRNAs spliced intron TSS pA polyadenylation signal transcription start site protein coding regions AAA polyadenylation ATG translation start site non-coding regions
Novel exon-junction discovery ( Paired End ) TSS TSS TSS pA pA pA ATG ATG ATG AAA Cons: Pros: Reasonable coverage required Very sensitive Accuracy dependent on insert size distribution Sequencing twice as expensive genomic DNA microRNAs spliced intron TSS pA polyadenylation signal transcription start site protein coding regions AAA polyadenylation ATG translation start site non-coding regions
Novel exon-junction discovery ( de novo ) ACGATAT G ACACGTACAGTCAA A TCGT Non-matching tags ACGATATTACACGTACA T TCAAGTCGT ACGATATTACACG C ACAGTCAAGTCGT CGATATTACACGT C CAGTCAAGTCGTT ATATT T CACGTACAGTCAAGTCGTTCG remove adaptor sequence aligned reads ATATTA A ACGTACAGTCAAGTCGTTCG ATT G CACGTACAGTCAAGTCGTTCGGA ATTACACGTACAGTCA C GTCGTTCGGA Create consensus read CACGTACAG T CAAGTCGTTCGGAACCT CACGTAC CT TCAAGTCGTTCGGAACCT ACGATATTACACGTACAGTCAAGTCGTTCGGAACCT consensus read Blat against genome Pros: Cons: De novo Requires high coverage
Novel exon-junction discovery ( Top Hat ) TSS TSS TSS pA pA pA ATG ATG ATG AAA http://tophat.cbcb.umd.edu Pros: Cons: Very sensitive Relies on reference genomic DNA microRNAs spliced intron TSS pA polyadenylation signal transcription start site protein coding regions AAA polyadenylation ATG translation start site non-coding regions
Look at your data! Gene Symbol GRB7 Exon-exon junction usage Alternative splicing Single nucleotide resolution coverage plot Known gene structure Novel exons or novel transcripts (exons and introns)
Presentation Outline RNAseq Introduction RNAseq post-mapping miRNAseq Conclusions analysis Transcriptional RNAseq Uniqueome isomiRs Things to complexity Mapping consider How to Information Surveying Novel exon- measure a content Take home transcriptional junction transcript messages complexity with discovery Expression microarrays Transcript Thresholding assembly RNA-seq
Different aligners give different results The patterns are largely the same so don’t panic… … unless you’re doing RNAseq Koehler et al Bioinformatics 2011 27(2):272-274
Uniqueome affects quantitation of RNAseq Correction for unique content improves correlation to microarrays
RNAseq TSS TSS TSS pA pA pA ATG ATG pA TSS AAA ATG AAA ATG ATG AAA ATG AAA ATG AAA ATG ATG AAA AAA Cloonan et al . Nat Methods 2008 ; 5:613-619
How to detect a transcript? A AAA B AAA C AAA D AAA AAA polyadenylation spliced intron protein coding regions non-coding regions
How to detect a transcript? Accuracy relies on the quality of the gene models used. A AAA B AAA C Different gene models will give AAA different results from the same data. D AAA ~80% 92.6% known transcripts have diagnostic features (covers 99.8% of loci) 217127 diagnostic features covering 160156 individual transcripts from 65254 loci AAA polyadenylation spliced intron protein coding regions non-coding regions
Reference assisted transcript assembly Scripture Cufflinks Guttman et al., Nat Biotech 2010 28( 5 ):503-10
Reference free alignment - de novo assembly Gene Symbol: MGAT5 Trinity Oases Abyss Gene Symbol: RAN Cloonan et al ., Unpublished
Presentation Outline RNAseq Introduction RNAseq post-mapping miRNAseq Conclusions analysis Transcriptional RNAseq Uniqueome isomiRs Things to complexity Mapping consider How to Information Surveying Novel exon- measure a content Take home transcriptional junction transcript messages complexity with discovery Expression microarrays Transcript Thresholding assembly RNA-seq
miRNAs Drosha Dicer Processing Processing 5’ 3’ miRNA duplex 5’ 3’ 5’ pri-miRNA 3’ pre-miRNA Asymmetrical Unwinding Most interactions thought to occur in the 3’ UTR 3’ 5’ 3’ 5’ AAAAAAAAAAAAAA 3’ 5’ RNA-Induced Silencing Complex mRNA (RISC) RISC-mRNA Translational mRNA mRNA interactions Inhibition sequestration degradation
MicroRNAs are small and closely related 60 Proportion of miRNAs (%) 50 * 20 miR-17-5p : CAAAGUGCUUACAGUGCAGGUAGU 40 UAAAGUGCUUAUAGUGCAGGUAG- miR-20 : AAAAGUGCUUACAGUGCAGGUAGC miR-106a : UAAAGUGCUGACAGUGCAGAU--- miR-106b : -AAAGUGCUGUUCGUGCAGGUAG- 30 miR-93 : UAAGGUGCAUCUAGUGCAGAUA-- miR-18 : AAaGUGCu aGUGCAG Ua * 20 miR-19a : 20 UGUGCAAAUCUAUGCAAAACUGA- miR-19b-1 : UGUGCAAAUCCAUGCAAAACUGA- miR-19b-2 : UGUGCAAAUCCAUGCAAAACUGA- UGUGCAAAUCcAUGCAAAACUGA 10 0 15 16 17 18 19 20 21 22 23 24 25 Length of miRNAs (nt)
Information content in short tags Map to a subset of the genome instead
Not allowing mismatches does not solve the problem tagcgggatctctcga g agctcgcgat miR A 1 MM 0 MM tctctcga c agct tctctcga g agct 1 MM 0 MM tagcgggatctctcga c agctcgcgat miR B
IsomiRs are common and functional 5’ 3’ pre-miRNA Cloonan et al . Genome Biol 2011 ; 12(12):R126
Expression Thresholding Cloonan et al . Genome Biol 2011 ; 12(12):R126
Presentation Outline RNAseq Introduction RNAseq post-mapping miRNAseq Conclusions analysis Transcriptional RNAseq Uniqueome isomiRs Things to complexity Mapping consider How to Information Surveying Novel exon- measure a content Take home transcriptional junction transcript messages complexity with discovery Expression microarrays Transcript Thresholding assembly RNA-seq
Things to consider Check your data! visualization strategies IGV (brilliant for individual read resolution) UCSC (brilliant for genomic context of expression) Heatmaps, etc. (brilliant for quantification) Check your mapping statistics % mapped?, % mapped at what length?, redundancy etc. Make sure the controls are doing what they should be Remember the limitations and parameters of your alignment strategy - be careful with interpretation! Eg. Variable alignment strategies that trim starts and ends of tags will overestimate the relative complexity of your library Eg. Discarding all tags that map to multiple regions will limit your ability to detect closely related gene families, or sequence motifs in repetitive/low complexity areas
Conclusions RNAseq and miRNAseq both require special attention to mapping strategies Choose an alignment strategy that will answer your biological question first and foremost, and then consider available resources If your strategy won’t work, it’s better to know BEFORE sequencing rather than afterwards. Check your mapped data – better to find errors before extensive analysis and validation Be careful in your interpretation of the data
Recommend
More recommend