Mapping short RNA-Seq by comparing tree Work in progress Possibly - PowerPoint PPT Presentation

Mapping short RNA-Seq by comparing tree Work in progress Possibly useless Matthias Zytnicki INRAE, MIAT DSB 2020 1 / 20

RNA-Seq Griffiths et al., PLOS Comp. Biol., 2015 2 / 20

Mapping Definition Prediction of the locus which produced the RNA read . Read Genome ACGT CATCAGTCTAGACGTTCACAACCA ⇒ chr1:12–15 Tricky situations • Reads may be slightly different from the genome sequence. Read Genome ACGT CATCAGTCTAGACGGTCACAACCA • Corresponding loci are repeated. Read Genome ACGT ACATACGTTCACACGTCGAT 3 / 20

Our question Particularities of sRNA-Seq • A population of different classes of small RNAs: miRNAs, tRFs, siRNAs, piRNAs, etc. • They are short (about 22–24bp, after trimming). • Sequences are highly duplicated ( ∼ 5% the exact same read). • Most mismatches happen at the ends of the reads. from miRBase 4 / 20

Our question — Cont. Observation • Most mapping tool developments are dedicated to long reads. • There is no dedicated tool for sRNAs. Usual (biological) query For each read, get me all the regions with minimum number of mismatches n , with n ≤ k . 5 / 20

Data Reads • Stored in a tree. • Counts, and best quality is kept. ε @read1 @read4 A CGA 1 A C + + H H HHI @read2 @read5 2 G T CG CGC II + + HI IIH @read3 @read6 A C CG CT + + IH II 6 / 20

Data Genome Suffix tree • Stored in a suffix array. ε • Using BWA implementation. Example A B N BANANA N A A A N N N A A 7 / 20

Data List of suffixes Genome BANANA • Stored in a suffix array. ANANA • Using BWA implementation. NANA ANA Example NA A BANANA Suffix array 5 A 3 ANA 1 ANANA 0 BANANA 4 NA 2 NANA 8 / 20

Main idea Aim • For each accepting “read node,” compute the all the “genome nodes” with minimum distance not greater than k . • For each “reads node,” compute recursively the all the “genome nodes” with distance not greater than k . ε ε A C A C G T A T A C A C G Note: The genome tree here is not an actual suffix tree. It is just presented as an illustration. 9 / 20

Implementation . . . . . . A A 1 err. T T match . . . 10 / 20

Optimization 1 Expect a 0-error mapping first • Map with no error first. • In case of error at depth d , add an error up to depth d . ε A A C . . . A 11 / 20

Optimization 2 Map the unbranched regions “the usual way” When a read unbranched terminal path is found, gather all the corresponding genome sequences, and apply a banded Smith-Waterman up to the leaves. ε ε A A . . . A C G . . . . . . . . . 12 / 20

Optimization 3 The genome tree is a vector of 4 8 trees • The first tree is labelled AAAAAAAA . • The second tree is labelled AAAAAAAC . • etc. • Each tree starts at depth 8. A AAAAAAAA A C A AAAAAAAC 13 / 20

Other optimizations Remove low complexity reads ACACACACACA Use radix tree instead of standard tree for the reads tree A C G ACG ⇒ Can process several reads files A file 1: 20 C file 2: 43 14 / 20

Results Test case • 15,492,953 reads of size 15–101. • Genome: A. thaliana . • BWA aln: 14min, 221kB. • srnaMapper: 6min, 1.6GB. Bottleneck % cumulative self time seconds seconds calls name 47.53 161.27 161.27 bwt_2occ 26.24 250.28 89.01 bwt_occ 9.92 283.93 33.65 43390524 mapWithoutError 15 / 20

Problem # states increase Compare to dynamic programming A A A ε A 0 1 2 → → ε 1 err. ↓ ց ց A match A 1 0 1 → 1 err. insert. Bottom line • You do not want all the mappings. • How to implement a good # states vs states elimination balance? 16 / 20

Implementation details — Reads First pass • Edges contain the nucleotides (and the size), and the address to the following node. • No predefined order. • Each node contains 4 edges, the read counts, and the qualities. Second pass • Nodes are sorted in a depth-first fashion. • Read counts and qualities are stored in a parallel vector. 17 / 20

Implementation details — Rest Genome • Tree: the BWA structure. • Buffer: last children intervals are kept in memory. Smith-Waterman A (stupid) read length × (2 k + 1) matrix. 18 / 20

Next • Clever way to reduce the number of states. • Bug fixes (read mapping at the ends of a chromosome. . . ). • Other optimizations (branch sequences in an external string?). • Use several processors. • Available at https://github.com/mzytnicki/srnaMapper (branch sw ). 19 / 20

That’s all, folks! Thank you for your attention! 20 / 20

Mapping short RNA-Seq by comparing tree Work in progress Possibly - PowerPoint PPT Presentation

Mapping short RNA-Seq by comparing tree Work in progress Possibly useless Matthias Zytnicki INRAE, MIAT DSB 2020 1 / 20 RNA-Seq Griffiths et al., PLOS Comp. Biol., 2015 2 / 20 Mapping Definition Prediction of the locus which produced the

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

RNA-seq read mapping Pr Engstrm SciLifeLab

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

What is single-cell RNA-Seq, and why is it useful? S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

CPSC 121: Models of Computation Translate back and forth between simple natural language

Sorting Parity Encodings by Reusing Variables Leroy Chew and Marijn J.H. Heule 23rd International

Log-gases on a quadratic lattice via discrete loop equations Alisa Knizel Columbia University

Genericity, Infinitary Interpretations, and Automorphism Groups of Structures Russell Miller

Evolution Life Through Time J. D. Price Evolution remains at the center of controversy in

Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

Mapping short RNA-Seq by comparing tree Work in progress Possibly - PowerPoint PPT Presentation

Mapping short RNA-Seq by comparing tree Work in progress Possibly useless Matthias Zytnicki INRAE, MIAT DSB 2020 1 / 20 RNA-Seq Griffiths et al., PLOS Comp. Biol., 2015 2 / 20 Mapping Definition Prediction of the locus which produced the

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi &lt; lg

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

RNA-seq read mapping Pr Engstrm SciLifeLab

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

What is single-cell RNA-Seq, and why is it useful? S IN GLE-CELL RN A-S EQ W ORK F LOW S IN R

CPSC 121: Models of Computation Translate back and forth between simple natural language

Sorting Parity Encodings by Reusing Variables Leroy Chew and Marijn J.H. Heule 23rd International

Log-gases on a quadratic lattice via discrete loop equations Alisa Knizel Columbia University

Genericity, Infinitary Interpretations, and Automorphism Groups of Structures Russell Miller

Evolution Life Through Time J. D. Price Evolution remains at the center of controversy in

Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg