Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent Noé Hélène T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr CRIStAL – UMR CNRS 9189 – BONSAI team
Overview 1) K-mers 2) Long read sequencing 3) K-mers with errors 4) Use case: Nanopore adapters 5) Results 6) Conclusion 2 Quentin Bonenfant - SeqBio 2018
K-mers ● Substring of size k k =8 ATCAGTCAGCGGGTATCTACTGCACCTATCGAGCTTTTTT ● Used for: – Assembly (SPAdes → De Bruijn Graph) – Mapping (Bowtie2 → Burrows-Wheeler T ransform) – Overlapping (Minimap2 → Minimizers) – … 3 Quentin Bonenfant - SeqBio 2018
Long read sequencing 10-15% of errors k-mer Insertion ATCAGTCAGCGGGGTATCTACTC---CACCTATCGAGCTTTTTTATCT ||||||||||| |||| |||| ||||||||||||||| ||||| ATCAGTCAGCG --- TATC G ACTC TAG CACCTATCGAGCTTT -- TATCT Deletion Substitution 4 Quentin Bonenfant - SeqBio 2018
Long read sequencing How to account for sequencing errors? → k -mers with errors → d : max number of errors 5 Quentin Bonenfant - SeqBio 2018
K-mers with errors ? ACTTCCGG AATTCCGG AATTC-GG d =1 AATTTCCGG k =8 ... 6 Quentin Bonenfant - SeqBio 2018
How ? ● Using dynamic programming → Large computational cost ● Indexing all neighbours → Memory expensive / long to compute ● Research with errors in an index → 01*0 seeds 7 Quentin Bonenfant - SeqBio 2018
01*0 seeds ● Approximate seeds ● Lossless ● Principle: – Choose a value for d – Split k-mer in d +2 blocks – Search blocks in the index 8 Quentin Bonenfant - SeqBio 2018
01*0 seeds Pigeonhole principle 4 pigeons ( d ) 6 holes ( d +2) → At least 2 holes are empty 9 Quentin Bonenfant - SeqBio 2018
01*0 seeds Example Finding “ AUCAGUGCAAAUGCUCAAGA ” d =3 k = 20 → Split in 5 blocs of size 4 10 Quentin Bonenfant - SeqBio 2018
01*0 seeds AUCA GUGC AAAU GCUC AAGA 1) |||| ||| | || || | |||| AUCA AUGC A-AU GCGC AAGA 0 1 1 1 0 AUCA GUGC AAAU GCUC -AAGA 2) ||| |||| | || |||| |||| AUC- GUGC AUAU GCUC AAAGA 0 1 0 AUCA GUGC AAAU GCUC AAGA 3) |||| | | |||| || | |||| AUCA GAGA AAAU GC-C AAGA 0 1 0 11 Quentin Bonenfant - SeqBio 2018
01*0 seeds ● First implementation – BWOLO (2014) – BWT Vroland C, Salson M, Bini S, Touzet H. Approximate search of short patterns with high error rates using the 01 ⁎ 0 lossless seeds. Journal of Discrete Algorithms 37, 2016 ● SeqAn implementation – Optimum Search Scheme (2018) – Bidirectional BWT Kiavash K, Pockrandt C, Torkamandi B, Luo H, and Reinert K. FAMOUS: Fast Approximate String Matching Using OptimUm Search Schemes. Recomb-Seq 2018 12 Quentin Bonenfant - SeqBio 2018
Use case: Motif inference for Nanopore adapters 13 Quentin Bonenfant - SeqBio 2018
Nanopore adapters sequence 14 Quentin Bonenfant - SeqBio 2018
Nanopore adapters sequence ● Sequencing adapters sequence ● Porechop – https://github.com/rrwick/Porechop/ – Adapter trimming – Known adapters database ● Can we guess the adapter sequence from the reads? 15 Quentin Bonenfant - SeqBio 2018
Our method ● Identify k-mers composing the adapter – Higher frequency at the start / end of reads ● Reconstruct adapter from k-mers 16 Quentin Bonenfant - SeqBio 2018
Frequency of k-mers k =16 17 Quentin Bonenfant - SeqBio 2018
Counting k-mers with errors ● Select the 500 most frequent 16-mers ● Count all occurences with d =2 errors 18 Quentin Bonenfant - SeqBio 2018
Counting result example K-mers with errors Exact k-mers k= 16 d= 2 k= 16 d= 0 K-mer count 0 err 1err 2err K-mer TTCAGTTACGTATTGC 2761 2761 4403 5844 TTCAGTTACGTATTGC TCAGTTACGTATTGCT 2716 2626 4324 6002 CTTCGTTCAGTTACGT CTATCTTCGGCGTCTG 2628 2612 4420 5905 CGTTCAGTTACGTATT TCTATCTTCGGCGTCT 2628 2716 4361 5813 TCAGTTACGTATTGCT 2567 4423 5837 GTTCAGTTACGTATTG CTTCGTTCAGTTACGT 2626 2359 4276 5895 TCGTTCAGTTACGTAT CGTTCAGTTACGTATT 2612 2447 4048 5591 TTCGTTCAGTTACGTA GTTCAGTTACGTATTG 2567 2628 3999 4775 CTATCTTCGGCGTCTG CTCTATCTTCGGCGTC 2563 2628 3934 4748 TCTATCTTCGGCGTCT GCTCTATCTTCGGCGT 2509 2563 3900 4649 CTCTATCTTCGGCGTC CTGTCGCTCTATCTTC 2491 19 Quentin Bonenfant - SeqBio 2018
k-mer ranks plot 20 Quentin Bonenfant - SeqBio 2018
How adapter sequence is built Rank K-mers k=16 1 TTCAGTTACGTATTGC C T T C A G T T A C G T A T T G T T C A G T T A C G T A T T G C 2 CTTCAGTTACGTATTG T C A G T T A C G T A T T G C T 3 CGTTCAGTTACGTATT C A G T T A C G T A T T G C T G A G T T A C G T A T T G C T G T 4 TCAGTTACGTATTGCT G T T A C G T A T T G C T G T T 5 GTTCAGTTACGTATTG T T A C G T A T T G C T G T T C T A C G T A T T G C T G T T C T 6 GTTACGTATTGCTGTT C T T C A G T T A C G T A T T G C T G T T C T TTACGTATTGCTGTTC 7 CAGTTACGTATTGCTG 8 TACGTATTGCTGTTCT 9 CTCTATCTTCGGCGTC 10 AGTTACGTATTGCTGT 11 21 Quentin Bonenfant - SeqBio 2018
Dataset 1 ● Consortium ANR ASTER – Algorithms and software for third generation sequencing ● Prep and sequencing: Genoscope – Specie: Mouse ( Mus musculus ) – Tissue: brain – Sample T ype: 1D cDNA – Flowcell: R9.4 – ENA/SRA : PRJEB25574 22 Quentin Bonenfant - SeqBio 2018
Experiment ● Sample size – 10,000 reads, 100 fjrst bases – k = 16, d = 2 ● Run the workfmow on 100 samples ● Compare results for both counting methods (k-mers with and without errors) 23 Quentin Bonenfant - SeqBio 2018
Results with exact k-mers 24 Quentin Bonenfant - SeqBio 2018
Results with approximate k-mers 25 Quentin Bonenfant - SeqBio 2018
The MEME approach ● MEME Multiple EM for Motif Elicitation Bailey and Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, ISMB 1994. ● Experiment – 1000 random reads, fjrst 100 bases – Repeated on 5 samples 26 Quentin Bonenfant - SeqBio 2018
MEME results 27 Quentin Bonenfant - SeqBio 2018
Results Exact k-mers Approximate k-mers MEME 28 Quentin Bonenfant - SeqBio 2018
Dataset 2 ● Nanopore wgs consortium – Oxford Nanopore human reference datasets https://github.com/nanopore-wgs-consortium/NA12878 ● Data from RNA project – https://github.com/nanopore-wgs-consortium/NA12878/blob/mast er/nanopore-human-transcriptome/fastq_fast5_bulk.md – Sample T ype: 1D cDNA – Cell line: GM12878 human cell line (Ceph/Utah pedigree) – Kit : SQK-PCS108 – Flowcell: R9.4 – File: Bham_Run1_20171115_1D.pass.depup.fastq ● Same experiment 29 Quentin Bonenfant - SeqBio 2018
Exact k-mers results 67 different adapters found 30 Quentin Bonenfant - SeqBio 2018
Approximate k-mers results (WGS) PCR adapters 3 (start) ← T-ACTTGCCTGTCGCTCTATCTTC 31 Quentin Bonenfant - SeqBio 2018
Approximate k-mers results (WGS) Approximate k-mers Exact k-mers 32 Quentin Bonenfant - SeqBio 2018
Implementation ● C++ with SeqAn library (optimal search schemes) ● Computation time For 10k reads Exact k-mers : <1 second K-mers with errors: 10-20 seconds MEME: >180h 33 Quentin Bonenfant - SeqBio 2018
Porechop wrapper ● Integration in Porechop (Python) Custom wrapper* allow integration in Porechop workfmow easily by adding inferred adapters to adapter database ● T wo case studies : → discovering the adapter sequence if unknown → checking (if known) adapter is present and correctly sequenced (quality check) ● C++ and Python code available on demand 34 Quentin Bonenfant - SeqBio 2018
Conclusion ● Our goal was to test the effjciency of 01*0 seeds using k-mers approach on noisy reads ● Our experiment showed – More consistent results with approximate k-mers – Practical running time for real data → k-mers with errors can improve results at low cost ● Could be used as an alternative to minimizers ? 35 Quentin Bonenfant - SeqBio 2018
End Thank you for your attention 36 Quentin Bonenfant - SeqBio 2018
Recommend
More recommend