high performance computing for dna sequence alignment and
play

High Performance Computing for DNA Sequence Alignment and Assembly - PowerPoint PPT Presentation

High Performance Computing for DNA Sequence Alignment and Assembly Michael C. Schatz May 18, 2010 Stone Ridge Technology Outline 1. Sequence Analysis by Analogy 2. DNA Sequencing and Genomics 3. High Performance Sequence Analysis 1.


  1. High Performance Computing for DNA Sequence Alignment and Assembly Michael C. Schatz May 18, 2010 Stone Ridge Technology

  2. Outline 1. � Sequence Analysis by Analogy 2. � DNA Sequencing and Genomics 3. � High Performance Sequence Analysis 1. � Read Mapping 2. � Mapping & Genotyping 3. � Genome Assembly

  3. Shredded Book Reconstruction • � Dickens accidentally shreds the first printing of A Tale of Two Cities – � Text printed on 5 long spools It was the best of It was the best of It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … times, it was the worst times, it was the worst of times, it was the of times, it was the age of wisdom, it was age of wisdom, it was the age of foolishness, … the age of foolishness, … It was the best It was the best of times, it was the of times, it was the worst of times, it was worst of times, it was the age of wisdom, it the age of wisdom, it was the age of foolishness, was the age of foolishness, It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the It was the It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … best of times, it was best of times, it was the worst of times, it the worst of times, it was the age of wisdom, was the age of wisdom, it was the age of it was the age of foolishness, … foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It It was the best of times, was the best of times, It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … it was the worst of it was the worst of times, it was the age times, it was the age of wisdom, it was the of wisdom, it was the age of foolishness, … age of foolishness, … • � How can he reconstruct the text? – � 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – � The short fragments from every copy are mixed together – � Some fragments are identical

  4. Greedy Reconstruction It was the best of age of wisdom, it was best of times, it was it was the age of it was the age of It was the best of it was the worst of was the best of times, the best of times, it of times, it was the best of times, it was of times, it was the of times, it was the of wisdom, it was the of times, it was the the age of wisdom, it times, it was the worst the best of times, it times, it was the age the worst of times, it times, it was the age The repeated sequence make the correct times, it was the worst reconstruction ambiguous was the age of wisdom, • � It was the best of times, it was the [worst/age] was the age of foolishness, was the best of times, was the worst of times, Model sequence reconstruction as a graph problem. wisdom, it was the age worst of times, it was

  5. de Bruijn Graph Construction • � D k = (V,E) • � V = All length-k subfragments (k < l) • � E = Directed edges between consecutive subfragments • � Nodes overlap by k-1 words Original Fragment Directed Edge It was the best of It was the best was the best of • � Locally constructed graph reveals the global sequence structure • � Overlaps between sequences implicitly computed de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001

  6. de Bruijn Graph Assembly It was the best was the best of the best of times, it was the worst best of times, it was the worst of the worst of times, of times, it was worst of times, it times, it was the it was the age the age of foolishness A unique Eulerian tour of the graph reconstructs the was the age of the age of wisdom, original text age of wisdom, it If a unique tour does not exist, try to simplify the of wisdom, it was graph as much as possible wisdom, it was the

  7. de Bruijn Graph Assembly It was the best of times, it it was the worst of times, it of times, it was the the age of foolishness it was the age of A unique Eulerian tour of the age of wisdom, it was the the graph reconstructs the original text If a unique tour does not exist, try to simplify the graph as much as possible

  8. Shredded Book Mapping • � Dickens searches for misprints in the shredded copies – � Find the best match for each fragment – � Has to account for random and systematic variations It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the w u rst of times, it was the age of wis s dom, it was the age of f o lishness, … It was the b i st of ti n es, it was the w u rst of times, it was the age of wisdom, it was the age of f o lishness, It was the best of times, it was the w u rst of times, it was the i ge of wisdom, it was the age of f o lishness, … It was the best of times, it was the w u rst of times, it was the age of wisdom, it was the age of f o lishness, … It was the best of times, it was the w u rst of times, it was the age of wisdom, it was the age of f o lishness, … Confirmed Confirmed Mismatch Deletion

  9. Genomics and Evolution Your genome influences (almost) all aspects of your life – � Anatomy & Physiology: 10 fingers & 10 toes, organs, neurons – � Diseases: Sickle Cell Anemia, Down Syndrome, Cancer – � Psychological: Intelligence, Personality, Bad Driving – � Genome as a recipe, not a blueprint Like Dickens, we can only sequence small fragments of the genome

  10. DNA Sequencing Genome of an organism encodes the genetic information in long sequence of 4 DNA nucleotides: ACGT – � Bacteria: ~3 million bp – � Humans: ~3 billion bp Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads – � Per-base error rate estimated at 1-2% (Simpson et al , 2009) – � Sequences originate from random positions of the genome – � Base calling transforms raw images into DNA sequences Recent studies of entire human genomes analyzed 3.3B ATCTGATAAGTCCCAGGACTTCAGT (Wang, et al., 2008) & 4.0B (Bentley, et al., 2008) 36bp GCAAGGCAAACCCGAGCCCAGTTT reads TCCAGTTCTAGAGTTTCACATGATC – � ~100 GB of compressed sequence data GGAGTTAGTAAAAGTCCACATTGAG

  11. The Evolution of DNA Sequencing Year Genome T echnology Cost 2001 Venter et al. Sanger (ABI) $300,000,000 2007 Levy et al. Sanger (ABI) $10,000,000 2008 Wheeler et al. Roche (454) $2,000,000 2008 Ley et al. Illumina $1,000,000 2008 Bentley et al. Illumina $250,000 2009 Pushkarev et al. Helicos $48,000 2009 Drmanac et al. Complete Genomics $4,400 ( Pushkarev et al. , 2009) Critical Computational Challenges: Alignment and Assembly of Huge Datasets

  12. Why HPC? • � Moore’s Law is valid in 2010 – � But CPU speed is flat – � Vendors adopting parallel solutions instead • � Parallel Environments – � Many cores, including GPUs – � Many computers – � Many disks • � Why parallel – � Need results faster – � Doesn’t fit on one machine The Free Lunch Is Over: A Fundamental Turn T oward Concurrency in Software Herb Sutter, http://www.gotw.ca/publications/concurrency-ddj.htm

  13. Hadoop MapReduce • � MapReduce is the parallel distributed framework invented by Google for large data computations. – � Data and computations are spread over thousands of computers, processing petabytes of data each day (Dean and Ghemawat, 2004) – � Indexing the Internet, PageRank, Machine Learning, etc… – � Hadoop is the leading open source implementation • � Benefits • � Challenges – � Scalable, Efficient, Reliable – � Redesigning / Retooling applications – � Easy to Program – � Not Condor, Not MPI – � Runs on commodity computers – � Everything in MapReduce

  14. K-mer Counting • � Application developers focus on 2 (+1 internal) functions – � Map: input � key:value pairs Map, Shuffle & Reduce – � Shuffle: Group together pairs with same key All Run in Parallel – � Reduce: key, value-lists � output ACA -> 1 � ACA:1 � (ATG:1) � (ACC:1) � ATG -> 1 � ATG:1 � ATGAACCTTA � (TGA:1) � (CCT:1) � CAA -> 1,1 � CAA:2 � (GAA:1) � (CTT:1) � GCA -> 1 � GCA:1 � (AAC:1) � (TTA:1) � TGA -> 1 � TGA:1 � TTA -> 1,1,1 � TTA:3 � (GAA:1) � (AAC:1) � ACT -> 1 � ACT:1 � (AAC:1) � (ACT:1) � AGG -> 1 � GAACAACTTA � AGG:1 � (ACA:1) � (CTT:1) � CCT -> 1 � CCT:1 � (CAA:1) � (TTA:1) � GGC -> 1 � GGC:1 � TTT -> 1 � TTT:1 � AAC -> 1,1,1,1 � AAC:4 � (TTT:1) � (GGC:1) � ACC -> 1 � ACC:1 � (TTA:1) � (GCA:1) � CTT -> 1,1 � TTTAGGCAAC � CTT:1 � (TAG:1) � (CAA:1) � GAA -> 1,1 � GAA:1 � (AGG:1) � (AAC:1) � TAG -> 1 � TAG:1 � map shuffle reduce

Recommend


More recommend