Fishing in a sample to discard irrelevant RNA-Seq reads Paola - PowerPoint PPT Presentation

Shark 🦉 Fishing in a sample to discard irrelevant RNA-Seq reads Paola Bonizzoni, Tamara Ceccato, Gianluca Della Vedova, Luca Denti, Yuri Pirola , Marco Previtali and Raffaella Rizzi DISCo - Univ. degli Studi di Milano-Bicocca DSB 2020 / Rennes / Feb. 4+5, 2020

The problem Sequencing technologies produce a lot of data Sequencing datasets are piling up in public repositories What if I want to analyze only a subset of genes (RNA- Seq studies)? Notice: aligning RNA-Seq reads to the genome is not an easy task! 2

Outline The Gene Assignment Problem The Algorithm Experimental results: Synthetic dataset Reproduction of real-world analyses 3

The Gene Assignment Problem Given: a set of RNA-Seq reads S a set of genes (genomic sequences) G two parameters and k ∈ N + τ ∈ [0, 1] Compute s.t. for each : { S 1 , … , S | G | } s ∈ S i ⊆ S the set of bases of "covered" by at least a -mer shared with has s k g i cardinality at least τ ⋅ | s | such a cardinality is maximum (w.r.t. other ). g j ∈ G 4

The Gene Assignment Problem - Goals We want to be: alignment-free (reduce potential alignment biases) highly sensitive (almost no reads are lost) as speci�c as possible (the dataset is reduced as much as possible) very fast use a modest amount of memory 5

The algorithm High-level idea: 1. Index the -mers of gene sequences in k G 2. Assign each read to its set of genes (if any) s ∈ S 6

The algorithm - Indexing genes Index : ⟨ BF , I , P ⟩ a Bloom �lter storing -mers of BF k G (with a single hash function ) H an integer vector storing indices of genes I associated to each -mer k a bit-vector providing a mapping between P BF and by "tagging" the boundaries among the I different subsets of genes in I 7

The algorithm - Indexing genes 8

The algorithm - Indexing genes 9

The algorithm - How to build the index? 1. Scan each -mer of the gene sequences in and k G store them in (using a single hash function ) BF H 2. Associate an empty list to each 1 in L v BF 3. Re-scan each -mer of each and add to the k e g i ∈ G i list associated to the 1 at position in L v H ( e ) BF 4. Copy (preserving the order) all the lists into while L v I tagging the boundaries between lists with a 1 in P 10

The algorithm - Assigning reads To assign each read s ∈ S 1. For each -mer : k e ∈ s 1. Query the index to �nd the genes containing e (FP are possible due to ) BF 2. For each gene, compute how many bases are covered by but not by previous -mers e k 2. Output the IDs of genes that cover the largest number of bases of if bases s ≥ τ ⋅ | s | 11

Implementation Shark is available at https://github.com/AlgoLab/shark Binaries through Bioconda conda config � � add channels bioconda conda install shark Libraries: sdsl � lite for DS Intel TBB for multi-threading 12

Does it work? 13

Experimental results How and affect sensitivity and speci�city? k τ → synthetic datasets Is Shark useful? Does it change the results? Does it make analyses faster and/or less memory hungry? → replication of "real-world" analyses 14

Synthetic datasets - Data Input data: RNA-Seq sample of 10M 100bp-long reads on chr1, 17, 21 k ∈ {13, 17, 23, 27, 31} τ ∈ {0.2, 0.4, 0.6, 0.8} -mers covering bases with PHRED quality score have been k < q discarded, with ( means no �lter) q ∈ {0, 10, 20} q = 0 10 different instances selecting random subsets of 100 genes each Accuracy measures: Recall = TP/(TP + FN) Precision = TP/(TP + FP) 15

Synthetic datasets - Results 16

Synthetic datasets - Results Good compromise: → . k = 17, τ = 0.6, q = 10 R = 99.46%, P = 28.8% 17

Synthetic datasets - Results Memory usage was always below 2.1GB 18

Replication of real-world analyses 19

Replication of real-world analyses Aim: Differential analysis of AS events Input data: 6 PE RNA-Seq samples (~180M 101bp-long reads) 82 distinct genes with 83 exp. validated exon skipping events Pipelines: SplAdder rMATS SUPPA2 20

Replication - Transcript quanti�cation 21

Replication - Diff. expressed events Validated events Pipeline Time [min] Memory [GB] All p-value < 0.05 rMATS 78 63 308 33.9 Shark + rMATS 78 63 142 33.9 SplAdder 56 NA 796 33.9 Shark + SplAdder 56 NA 295 33.9 SUPPA2 66 37 65 4.3 Shark + SUPPA2 66 43 34 4.4 Thanks to Shark we can also decrease the max memory consumption to less than 16GB w/o affecting running times (data not shown). 22

Speeding-up assembly-�rst analyses 23

Speeding-up assembly-�rst analyses Preliminary results: Almost no changes in stat. signif. results Signi�cant speed-up (from ~25h to ~3h) Signi�cantly less memory (from ~12.5GB to ~5.5GB) Ongoing work: Manual inspection of results 24

Conclusions Shark speeds up analyses by focusing on reads likely sequenced from genes of interest Highly sensitive (almost no reads are lost) Good precision (dataset is substantially reduced) For more details → preprint on Ongoing work: Algorithmic solution is simple (but effective). Can we do better? Clearly not suitable for all kind of analyses (i.e., transcriptome-wide analyses). But, if we want to focus on speci�c genes, do results change? 25

Thanks! 26

Synthetic datasets - Results 28

Fishing in a sample to discard irrelevant RNA-Seq reads Paola - PowerPoint PPT Presentation

Shark Fishing in a sample to discard irrelevant RNA-Seq reads Paola Bonizzoni, Tamara Ceccato, Gianluca Della Vedova, Luca Denti, Yuri Pirola , Marco Previtali and Raffaella Rizzi DISCo - Univ. degli Studi di Milano-Bicocca DSB 2020 /

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

PRIMARY SECTOR FISHING FISHING Definition Types of fishing: deep-sea and shallow-see

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Learning Concept Taxonomies from Multi-modal Data Hao Zhang Zhiting Hu, Yuntian Deng, Mrinmaya

Digital Literary Stylis.cs Anne BANDRY-SCUBBI Womens Novels 1750s-1830s and the Company They

Venues of Excellence Spring Members Forum Friday 5 th April 2019 West Court, Je Jesus Colle

Learning to Learn Kernels with Variational Random Features Presenter : Haoliang Sun Xiantong

Sound and Sharks, Investigating Detection from Different Directions: Detecting sharks

Wireshark Tutorial Chris Neasbitt UGA Dept. of Computer Science Contents Introduction

Ftrace Linux Kernel Tracing Steven Rostedt srostedt@redhat.com rostedt@goodmis.org

CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

Fishing in a sample to discard irrelevant RNA-Seq reads Paola - PowerPoint PPT Presentation

Shark Fishing in a sample to discard irrelevant RNA-Seq reads Paola Bonizzoni, Tamara Ceccato, Gianluca Della Vedova, Luca Denti, Yuri Pirola , Marco Previtali and Raffaella Rizzi DISCo - Univ. degli Studi di Milano-Bicocca DSB 2020 /

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi &lt; lg

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

PRIMARY SECTOR FISHING FISHING Definition Types of fishing: deep-sea and shallow-see

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Learning Concept Taxonomies from Multi-modal Data Hao Zhang Zhiting Hu, Yuntian Deng, Mrinmaya

Digital Literary Stylis.cs Anne BANDRY-SCUBBI Womens Novels 1750s-1830s and the Company They

Venues of Excellence Spring Members Forum Friday 5 th April 2019 West Court, Je Jesus Colle

Learning to Learn Kernels with Variational Random Features Presenter : Haoliang Sun Xiantong

Sound and Sharks, Investigating Detection from Different Directions: Detecting sharks

Wireshark Tutorial Chris Neasbitt UGA Dept. of Computer Science Contents Introduction

Ftrace Linux Kernel Tracing Steven Rostedt srostedt@redhat.com rostedt@goodmis.org

CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg