RNA-seq read mapping Pr Engstrm SciLifeLab - PowerPoint PPT Presentation

RNA-‑seq ¡read ¡mapping ¡ ¡ Pär ¡Engström ¡ ¡ SciLifeLab ¡RNA-‑seq ¡workshop ¡ April ¡2016 ¡

Input: ¡sequence ¡reads ¡(FASTQ ¡format) ¡ @D2BJQQN1:142:C03FNACXX:5:1101:12935:2174 1:N:0:GCCAAT ATGCTGTGCAGGGCCTTGAGAACATGCGGGGGAATACATGTGGGTTTTTGG + =??DDDDD2,4<D2+4AE################################# @D2BJQQN1:142:C03FNACXX:5:1101:13035:2084 1:N:0:GCCAAT TTTCTATAGTTCGTTACTAGAGAAGTTTTTTTGATTGTGTGGGGGTCCGGG + =??DD2=23AD,3CE223,3+3+33<CF+4)0?)*19?DD########### @D2BJQQN1:142:C03FNACXX:5:1101:13322:2013 1:Y:0:GCCAAT CGTTCCCGTGGTGGGATTTTTGGGTGGCAGGGGACTTCGGTTGGGGGATTT + :>>A2+20<AC@CC)+4@2<+1?AAA######################### @D2BJQQN1:142:C03FNACXX:5:1101:13460:2061 1:Y:0:GCCAAT CCGCGGTCGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGTGGGGTTTTTG + =??DDDD)<C):?###################################### …

Goal: ¡reads ¡mapped ¡to ¡genome ¡ 2 kb Scale hg19 chr2: 136,872,000 136,873,000 136,874,000 136,875,000 136,876,000 Gm12878 RNA-seq reads 40 _ Gm12878 RNA-seq coverage, minus strand Gm12878_minus 0 _ RefSeq Genes CXCR4 CXCR4 4.88 _ 100 vertebrates Basewise Conservation by PhyloP 100 Vert. Cons 0 - -4.5 _ Simple Nucleotide Polymorphisms (dbSNP 144) Found in >= 1% of Samples Common SNPs(144) Repeating Elements by RepeatMasker RepeatMasker

�� Spliced ¡alignment ¡ k �� Garber ¡et ¡al. ¡ Nature ¡Methods ¡2011 ¡

Introns ¡can ¡be ¡very ¡large! ¡ Human introns (Ensembl) 1.0 0.8 Cumulative proportion 0.6 0.4 0.2 0.0 10 100 1,000 10,000 100,000 1M 10M Intron size (bp)

Limited ¡sequence ¡signals ¡at ¡splice ¡sites ¡ 5’ss 3’ss BP H. sapiens -3 -3 -2 -2 -1 -1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 0 0 1 1 2 2 3 3 4 4 5 5 -12 -11 -10 D. rerio -1 -3 -2 -1 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 0 0 1 2 3 4 5 GT…AG ¡ 98.6% ¡ D. melanogaster -3 -2 -1 0 1 3 4 5 2 -9 -8 -7 -6 GC…AG ¡ -5 -4 -3 -2 -1 0 1.3% ¡ -12 -11 -10 P. chrysosporium AT…AC ¡ 0.1% ¡ -3 -2 -1 0 1 2 3 4 5 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 -12 -11 -10 S. cerevisiae -3 -2 -1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 0 1 2 3 4 5 -12 -11 -10 P. tricornutum -3 -2 -1 0 1 3 4 5 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 2 -12 -11 -10 A. thaliana -3 -2 -1 0 1 3 4 5 -1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 0 Iwata ¡and ¡Gotoh ¡ BMC ¡Genomics ¡ 2011 ¡

MulX-‑mapping ¡reads ¡and ¡pseudogenes ¡ FuncXonal ¡gene ¡ Processed ¡pseudogene ¡ Correct ¡read ¡alignment ¡ Incorrect ¡read ¡alignment ¡ IdenXcal, ¡spliced ¡ Mismatches, ¡not ¡spliced ¡ Note: ¡ • An ¡aligner ¡may ¡report ¡both ¡alignments ¡or ¡either ¡ • Some ¡search ¡strategies ¡and ¡scoring ¡schemes ¡give ¡preference ¡to ¡unspliced ¡alignments ¡

Current ¡RNA-‑seq ¡aligners ¡ TopHat2 ¡ Kim ¡et ¡al. ¡ Genome ¡Biology ¡2013 ¡ HISAT ¡& ¡HISAT2 ¡ Kim ¡et ¡al. ¡ Nature ¡Methods ¡ 2015 ¡ STAR ¡ Dobin ¡et ¡al. ¡ Bioinforma8cs ¡ 2013 ¡ GSNAP ¡ Wu ¡and ¡Nacu ¡ Bioinforma8cs ¡ 2010 ¡ OLego ¡ Wu ¡et ¡al. ¡ Nucleic ¡Acids ¡Research ¡ 2013 ¡ HPG ¡aligner ¡ Medina ¡et ¡al. ¡ DNA ¡Research ¡2016 ¡ MapSplice2 ¡ hap://www.netlab.uky.edu/p/bioinfo/MapSplice2 ¡

The ¡predecessor: ¡BLAT ¡ “In ¡the ¡process ¡of ¡assembling ¡and ¡annotaXng ¡the ¡human ¡genome, ¡I ¡was ¡faced ¡ with ¡two ¡very ¡large-‑scale ¡alignment ¡problems: ¡aligning ¡three ¡million ¡ESTs ¡and ¡ aligning ¡13 ¡million ¡mouse ¡whole-‑genome ¡random ¡reads ¡against ¡the ¡human ¡ genome. ¡These ¡alignments ¡needed ¡to ¡be ¡done ¡in ¡less ¡than ¡two ¡weeks’ ¡Xme ¡ on ¡a ¡moderate-‑sized ¡(90 ¡CPU) ¡Linux ¡cluster ¡in ¡order ¡to ¡have ¡Xme ¡to ¡process ¡ an ¡updated ¡genome ¡every ¡month ¡or ¡two. ¡To ¡achieve ¡this ¡I ¡developed ¡a ¡very-‑ high-‑speed ¡mRNA/DNA ¡and ¡translated ¡protein ¡alignment ¡algorithm. ¡“ ¡ ¡ (Kent ¡ Genome ¡Research ¡ 2002) ¡ ¡

InnovaXons ¡in ¡RNA-‑seq ¡alignment ¡soiware ¡ • Read ¡pair ¡alignment ¡ • Consider ¡base ¡call ¡quality ¡scores ¡ • SophisXcated ¡indexing ¡to ¡decrease ¡CPU ¡and ¡memory ¡usage ¡ ¡ • Map ¡to ¡geneXc ¡variants ¡ • Resolve ¡mulX-‑mappers ¡using ¡regional ¡read ¡coverage ¡ • Consider ¡juncXon ¡annotaXon ¡ • Two-‑step ¡approach ¡(juncXon ¡discovery ¡& ¡final ¡alignment) ¡ ¡

Two-‑step ¡RNA-‑seq ¡read ¡mapping ¡ 1 st )run)of)HISAT)to)discover)splice)sites) mapped) ) ) ) ) e1# e2# e3# ) ) ) ) unmapped) 2 nd )run)of)HISAT)to)align)reads)by)making)use)of)the)list)of)splice)sites)collected)above) ) ) ) ) e1# e2# e3# ) ) ) ) Read) Global)Search) Exon) Local)Search) Intron) Extension) Junc:on)extension) Kim ¡et ¡al. ¡ Nature ¡Methods ¡ 2015 ¡ Thi

Mapping ¡accuracy ¡ Correctly and uniquely Correctly mapped Incorrectly Unmapped mapped (multimapped) mapped 92.1 91.6 92.3 88.9 96.7 97.6 97.6 94.4 % (93.5) (91.6) (93.8) (90.5) (99) (99.2) (99.3) (97.4) ( % + %) 100 Percentage of reads 75 50 25 0 OLego GSNAP STAR HISAT TopHat2 HISAT × 1 STAR × 2 HISAT × 2 Accuracy ¡for ¡20 ¡million ¡simulated ¡human ¡100 ¡bp ¡reads ¡with ¡0.5% ¡mismatch ¡rate ¡ ¡ Kim ¡et ¡al. ¡ Nature ¡Methods ¡ 2015 ¡

Mapping ¡accuracy ¡for ¡reads ¡with ¡small ¡anchors ¡ Correctly and uniquely Correctly mapped Incorrectly Unmapped mapped (multimapped) mapped 88.2 88.5 90.4 51.1 94.4 97.2 97.2 93.5 % (89.4) (88.5) (91.7) (52.2) (96.4) (98.9) (99) (95.5) ( % + %) Percentage of reads, 100 75 2M_8_15 50 25 b 0 62.4% (M) 7.6 0 9.2 0 79.8 94.4 95.4 77.8 % (7.7) (0) (9.4) (0) (92.6) (97.3) (98.3) (96) ( % + %) Percentage of reads, 100 75 2M_1_7 3.1% (gt_2M) 50 4.2% (2M_1_7) 25 5.1% (2M_8_15) 0 25.1% (2M_gt_15) OLego GSNAP STAR HISAT TopHat2 HISAT × 1 STAR × 2 HISAT × 2 Kim ¡et ¡al. ¡ Nature ¡Methods ¡ 2015 ¡

Mapping ¡accuracy ¡for ¡spliced ¡RNA-‑seq ¡reads ¡ Simulation 1 Simulation 2 BAGET ann GEM ann GEM cons GEM cons ann GSNAP GSNAP ann GSTRUCT GSTRUCT ann MapSplice MapSplice ann PALMapper PALMapper ann PALMapper cons PALMapper cons ann PASS PASS cons ReadsMap SMALT STAR 1 − pass STAR 1 − pass ann STAR 2 − pass STAR 2 − pass ann TopHat1 TopHat1 ann TopHat2 TopHat2 ann 0 20 40 60 80 100 0 5 10 15 0 20 40 60 80 100 0 5 10 15 Percent of simulated spliced reads Percent of simulated spliced reads Percent of simulated spliced reads Percent of simulated spliced reads Perfectly mapped Part correctly mapped Mapped, no base correct No base correcly mapped but intersecting correct location High ¡accuracy ¡at ¡mapping ¡to ¡correct ¡locus: ¡GSNAP, ¡GSTRUCT, ¡MapSplice, ¡STAR ¡ ¡ High ¡rate ¡of ¡perfect ¡spliced ¡alignments: ¡ReadsMap, ¡TopHat2 ¡ann ¡ ¡ Engström ¡et ¡al. ¡ Nature ¡Methods ¡ 2013 ¡ ¡

Major ¡differences ¡in ¡indel ¡frequencies ¡ a BAGET ann 14.46 13.07 GEM ann 83.32 29.12 GEM cons 85.76 29.51 GEM cons ann 84.91 29.39 GSNAP 5.80 10.25 GSNAP ann 4.84 8.97 GSTRUCT 4.94 9.16 GSTRUCT ann 4.90 9.12 MapSplice 1.65 4.98 MapSplice ann 1.65 5.00 PALMapper 31.54 61.71 PALMapper cons 0.68 0.30 PASS 2.44 4.95 PASS cons 2.38 4.77 ReadsMap 2.70 4.48 SMALT 8.91 9.92 STAR 1 − pass 2.00 4.14 STAR 1 − pass ann 2.03 4.50 STAR 2 − pass 2.02 4.37 STAR 2 − pass ann 2.02 4.50 TopHat1 2.05 7.29 TopHat1 ann 2.05 7.33 TopHat2 6.71 6.09 TopHat2 ann 5.86 6.94 0 20 40 60 80 100 0 20 40 60 80 100 Insertions (%) Deletions (%) Indel size (bases): 1 2 3 4 5–7 8+ Indel ¡frequencies ¡are ¡tabulated ¡(number ¡of ¡indels ¡per ¡thousand ¡sequenced ¡reads). ¡Data ¡set: ¡K562 ¡(mean). ¡ ¡ Engström ¡et ¡al. ¡ Nature ¡Methods ¡ 2013 ¡

RNA-seq read mapping Pr Engstrm SciLifeLab - PowerPoint PPT Presentation

RNA-seq read mapping Pr Engstrm SciLifeLab RNA-seq workshop April 2016 Input: sequence reads (FASTQ format) @D2BJQQN1:142:C03FNACXX:5:1101:12935:2174

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Mapping short RNA-Seq by comparing tree Work in progress Possibly useless Matthias Zytnicki

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

What is single-cell RNA-Seq, and why is it useful? S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

A Translational Investigation of Metastasis Ning Zhang Tianjin Medical University Metastasis of

How I treat high risk Waldenstrms Macroglobulinemia? Christian Buske The first difficulty!

David Ritchie David Ritchie Violeta Prez-Nueno Violeta Prez-Nueno INRIA, Nancy Grant Est

Computer Architecture Summer 2020 Intel x86-64 Tyler Bletsch Duke University Basic differences

Profiling novel pharmacology of GPCR complexes Professor Kevin Pfleger using Receptor-HIT

Time-series-based Ensemble Modeling for Bio-Medical Applications Maciej Ogorzaek 1 , 2 in

Applications of Machine Learning in Computational Biology Narges Razavian New York University

Web-based Inference Detection Web 2.0 Security & Privacy, 5/24/2007 Richard Chow Philippe

Sambuz

Useful Links

Newsletter

Mail Us

RNA-seq read mapping Pr Engstrm SciLifeLab - PowerPoint PPT Presentation

RNA-seq read mapping Pr Engstrm SciLifeLab RNA-seq workshop April 2016 Input: sequence reads (FASTQ format) @D2BJQQN1:142:C03FNACXX:5:1101:12935:2174

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi &lt; lg

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Mapping short RNA-Seq by comparing tree Work in progress Possibly useless Matthias Zytnicki

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

What is single-cell RNA-Seq, and why is it useful? S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

A Translational Investigation of Metastasis Ning Zhang Tianjin Medical University Metastasis of

How I treat high risk Waldenstrms Macroglobulinemia? Christian Buske The first difficulty!

David Ritchie David Ritchie Violeta Prez-Nueno Violeta Prez-Nueno INRIA, Nancy Grant Est

Computer Architecture Summer 2020 Intel x86-64 Tyler Bletsch Duke University Basic differences

Profiling novel pharmacology of GPCR complexes Professor Kevin Pfleger using Receptor-HIT

Time-series-based Ensemble Modeling for Bio-Medical Applications Maciej Ogorzaek 1 , 2 in

Applications of Machine Learning in Computational Biology Narges Razavian New York University

Web-based Inference Detection Web 2.0 Security &amp; Privacy, 5/24/2007 Richard Chow Philippe

Sambuz

Useful Links

Newsletter

Mail Us

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

Web-based Inference Detection Web 2.0 Security & Privacy, 5/24/2007 Richard Chow Philippe