Alignment of High-Throughput Sequencing Data Inside - PowerPoint PPT Presentation

Alignment ¡of ¡ ¡ High-‑Throughput ¡Sequencing ¡Data ¡ ¡ Inside ¡In-‑Memory ¡Databases ¡ D. ¡FIRNKORN, ¡P. ¡KNAUP, ¡J. ¡LORENZO ¡BERMEJO, ¡ M. ¡GANZINGER ¡ ¡ Ins7tute ¡of ¡Medical ¡Biometry ¡and ¡Informa7cs, ¡Heidelberg ¡University, ¡Germany ¡

Mo7va7on ¡ Terabytes ¡of ¡data ¡by ¡NGS ¡plaLorms ¡produced ¡each ¡day ¡ Ø Adequate ¡analysis ¡of ¡high ¡throughput ¡data ¡ Ø DNA ¡alignment, ¡variant ¡calling ¡and ¡annota7on ¡ ¡ more ¡7me-‑consuming ¡than ¡DNA ¡sequencing ¡ ¡ DNA sequencing Analysis ~ 2-7 hours ~ 1-2 days Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 2

In-‑Memory ¡Compu7ng ¡ • Data, ¡procedures, ¡etc. ¡are ¡kept ¡in ¡main ¡memory ¡ • Compu7ng ¡opera7ons ¡within ¡the ¡database ¡itself ¡ • No ¡IO ¡between ¡applica7on ¡and ¡database ¡layer ¡ working ¡unit ¡ working ¡unit ¡ IO ¡unit ¡ IO ¡unit ¡ hard ¡disk ¡drive ¡ main ¡memory ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 3

Objec7ve ¡ • Inves7ga7on ¡of ¡in-‑memory ¡databases ¡for ¡DNA ¡alignment ¡ – SAP ¡HANA ¡appliance ¡ – MySQL ¡with ¡in-‑memory ¡engine ¡ ¡ • Development ¡of ¡stored ¡procedures ¡for ¡alignment ¡ – Test ¡case: ¡Burrows-‑Wheeler-‑Aligner ¡(BWA) ¡ – Performance ¡of ¡both ¡systems ¡evaluated ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 4

Data ¡Transforma7on ¡and ¡Bulk ¡Load ¡Process ¡ e ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 5

Methods ¡and ¡Tools ¡ • Reference ¡genome ¡precalcula7ons ¡for ¡BWA: ¡ – Construc7on ¡of ¡suffix ¡array ¡(SA) ¡ – Construc7on ¡of ¡burrows-‑wheeler-‑transforma7on ¡(BWT) ¡ • Development ¡of ¡stored ¡procedures ¡for ¡alignment: ¡ – First ¡within ¡MySQL ¡for ¡tes7ng ¡purpose ¡ – Por7ng ¡to ¡SAP ¡HANA, ¡syntax ¡adapta7on ¡ • System ¡informa7on: ¡ – Amazon-‑Cloud ¡EC2, ¡m2.xlarge ¡(17 ¡GB ¡main ¡memory) ¡ – SAP ¡HANA ¡and ¡MySQL ¡running ¡on ¡same ¡system ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 6

Process ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 7

Exact ¡Matching: ¡Performance ¡Comparison ¡ 24.6 ¡fold ¡ Time ¡in ¡Seconds ¡ 29.8 ¡fold ¡ Aaer ¡~ ¡2.5 ¡hours ¡execu7on ¡error ¡in ¡MySQL ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 8

Memory ¡Alloca7on ¡ a) HANA ¡installa7on ¡alone: ¡ 76 ¡% ¡ b) Including ¡reference ¡ genome: ¡99.5 ¡% ¡ MySQL: ¡Main ¡memory ¡full ¡allocable ¡with ¡data ¡ ¡ ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 9

Comparison ¡ MySQL ¡ SAP ¡HANA ¡ ¡ ¡ + ¡ ¡ ¡Open ¡source ¡ + ¡ ¡ ¡Compression ¡techniques ¡ + ¡ ¡ ¡Recursive ¡procedure ¡calls ¡ – No ¡recursive ¡procedure ¡call ¡ – No ¡data ¡compression ¡ – Expensive ¡licensing ¡ ± MEMORY ¡engine, ¡only ¡data ¡ ± Column ¡store ¡engine, ¡ in ¡main ¡memory ¡ everything ¡in ¡main ¡memory ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 10

Conclusion ¡ • Proof ¡of ¡concept: ¡DNA ¡alignment ¡inside ¡in-‑memory ¡databases ¡ • Implementa7on ¡and ¡comparison ¡of ¡stored ¡procedures ¡for ¡ exact ¡DNA ¡read ¡matching ¡ – SAP ¡HANA ¡technology ¡faster ¡ – Installa7on ¡without ¡data ¡needs ¡much ¡memory ¡ – Inexact ¡matching ¡only ¡in ¡MySQL ¡ ¡ ¡ ¡ Introduc7on ¡ Methods ¡ Results ¡ Discussion ¡ 11

Outlook ¡ • Algorithm ¡op7miza7on ¡ – Itera7ve ¡BWA ¡ – Scores ¡for ¡match, ¡mismatch ¡and ¡gaps ¡ – Seeding ¡ • SA ¡genera7on ¡as ¡stored ¡procedure ¡ • Examine ¡other ¡free ¡in-‑memory ¡databases: ¡ ¡ 12

DNA ¡Sequencing ¡Cost ¡and ¡Speed ¡ 14

Column-‑Store ¡Tables ¡ • SAP ¡HANA ¡consists ¡of ¡row ¡and ¡ column ¡engines ¡ • Tables ¡have ¡been ¡created ¡ within ¡the ¡column ¡engine ¡ • Faster ¡read ¡opera7ons ¡due ¡to ¡ compression ¡and ¡bejer ¡data ¡ access ¡ 15

Suffix-‑Array ¡Computa7on ¡ 16

Alignment of High-Throughput Sequencing Data Inside - PowerPoint PPT Presentation

Alignment of High-Throughput Sequencing Data Inside In-Memory Databases D. FIRNKORN, P. KNAUP, J. LORENZO BERMEJO, M. GANZINGER Ins7tute of

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Bioinformatics for High-Throughput Sequencing Misha Kapushesky St. Petersburg Russia 2010

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

A method for high throughput sequencing data analysis: application for mapping genome-wide

Spliced Spliced Transcripts Transcripts STAR STAR Alignment & Alignment &

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Image alignment Slides from Derek Hoiem, Svetlana Lazebnik Image source Alignment applications

Ultra high throughput DNA sequencing technologies Keith Harshman DNA Array Facility Center for

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

Algorithms in Bioinformatics: A f Practical Introduction Practical Introduction Peptide

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

AND OVERVIEW OF CONTRACT MANAGEMENT PR235 End User Training Columbia, SC Fall 2013 Version 2

Educating Consumers on Electric Vehicles: What Utilities Have Learned Advanced Energy Economy

Good morning! Internet, Web, Intranets, and Extranets Use and Functioning of the Internet

Webinar Participants 2 1 Mechanics of the Seminar 3 The webinar is being recorded, the URL

LECTURE 11: autonomous action is required. Intelligent agents are usefully applied in domains

2014 Half Year Results 13 August 2014 Agenda Henry Engelhardt, CEO Financial Results Geraint

Learning Learning to Rank Social Media Liebling mit ber 360.000 Facebook Fans Mehrfach

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Alignment of High-Throughput Sequencing Data Inside - PowerPoint PPT Presentation

Alignment of High-Throughput Sequencing Data Inside In-Memory Databases D. FIRNKORN, P. KNAUP, J. LORENZO BERMEJO, M. GANZINGER Ins7tute of

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Bioinformatics for High-Throughput Sequencing Misha Kapushesky St. Petersburg Russia 2010

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

A method for high throughput sequencing data analysis: application for mapping genome-wide

Spliced Spliced Transcripts Transcripts STAR STAR Alignment &amp; Alignment &amp;

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Image alignment Slides from Derek Hoiem, Svetlana Lazebnik Image source Alignment applications

Ultra high throughput DNA sequencing technologies Keith Harshman DNA Array Facility Center for

CSE P 527 Computational Biology 3: BLAST, Alignment score significance; PCR and DNA sequencing

Algorithms in Bioinformatics: A f Practical Introduction Practical Introduction Peptide

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

AND OVERVIEW OF CONTRACT MANAGEMENT PR235 End User Training Columbia, SC Fall 2013 Version 2

Educating Consumers on Electric Vehicles: What Utilities Have Learned Advanced Energy Economy

Good morning! Internet, Web, Intranets, and Extranets Use and Functioning of the Internet

Webinar Participants 2 1 Mechanics of the Seminar 3 The webinar is being recorded, the URL

LECTURE 11: autonomous action is required. Intelligent agents are usefully applied in domains

2014 Half Year Results 13 August 2014 Agenda Henry Engelhardt, CEO Financial Results Geraint

Learning Learning to Rank Social Media Liebling mit ber 360.000 Facebook Fans Mehrfach

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Spliced Spliced Transcripts Transcripts STAR STAR Alignment & Alignment &