using cloud technologies for bioinformatics applications
play

Using Cloud Technologies for Bioinformatics Applications MTAGS - PowerPoint PPT Presentation

Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology Institute Indiana University SALSA


  1. Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology Institute Indiana University SALSA SALSA

  2. Collaborators in SALSA Project Microsoft Research Indiana University Applications SALSA Technology Team Technology Collaboration Community Grids Lab Azure (Clouds) Bioinformatics, CGB and UITS RT – PTI Haixu Tang, Mina Rho, Dennis Gannon Geoffrey Fox Peter Cherbas, Qunfeng Dong Roger Barga Judy Qiu IU Medical School Dryad (Parallel Runtime) Scott Beason Gilbert Liu Christophe Poulain Jaliya Ekanayake Demographics (Polis Center) Thilina Gunarathne CCR (Threading) Neil Devadasan Thilina Gunarathne Jong Youl Choi George Chrysanthakopoulos Cheminformatics Yang Ruan David Wild, Qian Zhu DSS (Services) Seung-Hee Bae Physics Henrik Frystyk Nielsen Hui Li CMS group at Caltech (Julian Bunn) Saliya Ekanayake SALSA

  3. Convergence is Happening Data intensive application (three basic activities): capture, curation, and analysis (visualization) Data Intensive Paradigms Cloud infrastructure and runtime Clouds Multicore Parallel threading and processes SALSA

  4. MapReduce “File/Data Repository” Parallelism Map = (data parallel) computation reading and writing data Instruments Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Communication via Messages/Files Portals Map 1 Map 2 Map 3 Reduce /Users Disks Computers/Disks SALSA

  5. Cluster Configurations Feature GCB-K18 @ MSR iDataplex @ IU Tempest @ IU CPU Intel Xeon Intel Xeon Intel Xeon CPU L5420 CPU L5420 CPU E7450 2.50GHz 2.50GHz 2.40GHz # CPU /# Cores per 2 / 8 2 / 8 4 / 24 node Memory 16 GB 32GB 48GB # Disks 2 1 2 Network Giga bit Ethernet Giga bit Ethernet Giga bit Ethernet / 20 Gbps Infiniband Operating System Windows Server Red Hat Enterprise Windows Server Enterprise - 64 bit Linux Server -64 bit Enterprise - 64 bit # Nodes Used 32 32 32 Total CPU Cores Used 256 256 768 Hadoop/ Dryad / MPI DryadLINQ DryadLINQ / MPI SALSA

  6. Dynamic Virtual Cluster Architecture Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using Applications DryadLINQ, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping Apache Hadoop / MapReduce++ / Runtimes Microsoft DryadLINQ / MPI MPI Linux Virtual Windows Server Windows Server Linux Bare- Machines 2008 HPC Infrastructure 2008 HPC system software Bare-system Xen Virtualization Xen Virtualization XCAT Infrastructure Hardware iDataplex Bare-metal Nodes • Dynamic Virtual Cluster provisioning via XCAT • Supports both stateful and stateless OS images SALSA

  7. Cloud Computing: Infrastructure and Runtimes • Cloud infrastructure: outsourcing of servers, computing, data, file space, etc. – Handled through Web services that control virtual machine lifecycles. Cloud runtimes: tools (for using clouds) to do data-parallel • computations. – Apache Hadoop, Google MapReduce, Microsoft Dryad, and others – Designed for information retrieval but are excellent for a wide range of science data analysis applications – Can also do much traditional parallel computing for data-mining if extended to support iterative operations – Not usually on Virtual Machines SALSA

  8. Alu and Sequencing Workflow • Data is a collection of N sequences – 100’s of characters long – These cannot be thought of as vectors because there are missing characters – “Multiple Sequence Alignment” (creating vectors of characters) doesn’t seem to work if N larger than O(100) Can calculate N 2 dissimilarities (distances) between sequences (all pairs) • • Find families by clustering (much better methods than Kmeans). As no vectors, use vector free O(N 2 ) methods Map to 3D for visualization using Multidimensional Scaling MDS – also O(N 2 ) • • N = 50,000 runs in 10 hours (all above) on 768 cores Our collaborators just gave us 170,000 sequences and want to look at 1.5 million – • will develop new algorithms! • MapReduce++ will do all steps as MDS, Clustering just need MPI Broadcast/Reduce SALSA

  9. Pairwise Distances – ALU Sequences 125 million distances 4 hours & 46 minutes Calculate pairwise distances for a collection • of genes (used for clustering, MDS) 20000 • O(N^2) problem DryadLINQ 18000 • “Doubly Data Parallel” at Dryad Stage 16000 MPI • Performance close to MPI 14000 12000 • Performed on 768 cores (Tempest Cluster) 10000 8000 Processes work better than threads 6000 4000 when used inside vertices 2000 100% utilization vs. 70% 0 35339 50000 SALSA

  10. SALSA

  11. SALSA

  12. Hierarchical Subclustering SALSA

  13. Pairwise Clustering 30,000 Points on Tempest 6 Clustering by Deterministic Annealing 5 MPI 4 Parallel Overhead 3 Thread 2 Thread Thread Thread MPI 1 Thread Thread Thread 0 1 2 4 4 4 8 8 8 8 8 8 8 16 16 16 16 16 24 32 32 48 48 48 48 48 64 64 64 64 96 96 128 128 192 288 384 384 480 576 672 744 Parallelism MPI -1 MPI SALSA

  14. Dryad versus MPI for Smith Waterman Performance of Dryad vs. MPI of SW-Gotoh Alignment 7 Time per distance calculation per core (miliseconds) Dryad (replicated data) 6 Block scattered MPI 5 (replicated data) Dryad (raw data) 4 Space filling curve MPI (raw data) 3 Space filling curve MPI (replicated data) 2 1 0 0 10000 20000 30000 40000 50000 60000 Sequeneces Flat is perfect scaling SALSA

  15. Hadoop/Dryad Comparison “Homogeneous” Data 0.012 Dryad 0.01 Time per Alignment (ms) 0.008 Hadoop 0.006 0.004 0.002 0 30000 35000 40000 45000 50000 55000 Number of Sequences Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex Using real data with standard deviation/length = 0.1 SALSA

  16. Hadoop/Dryad Comparison Inhomogeneous Data I Randomly Distributed Inhomogeneous Data Mean: 400, Dataset Size: 10000 1900 1850 1800 1750 Time (s) 1700 1650 1600 1550 1500 0 50 100 150 200 250 300 Standard Deviation DryadLinq SWG Hadoop SWG Hadoop SWG on VM Inhomogeneity of data does not have a significant effect when the sequence lengths are randomly distributed Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes) SALSA

  17. Hadoop/Dryad Comparison Inhomogeneous Data II Skewed Distributed Inhomogeneous data Mean: 400, Dataset Size: 10000 6,000 5,000 Total Time (s) 4,000 3,000 2,000 1,000 0 0 50 100 150 200 250 300 Standard Deviation DryadLinq SWG Hadoop SWG Hadoop SWG on VM This shows the natural load balancing of Hadoop MR dynamic task assignment using a global pipeline in contrast to the DryadLinq static assignment Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes) SALSA

  18. Hadoop VM Performance Degradation 30% 25% 20% 15% 10% 5% 0% 10000 20000 30000 40000 50000 No. of Sequences Perf. Degradation On VM (Hadoop) • 15.3% Degradation at largest data set size SALSA

  19. PhyloD using Azure and DryadLINQ • Derive associations between HLA alleles and HIV codons and between codons themselves SALSA

  20. Mapping of PhyloD to Azure SALSA

  21. PhyloD Azure Performance • Number of active Azure Efficiency vs. number of worker • workers during a run of PhyloD roles in PhyloD prototype run on Azure March CTP application SALSA

  22. Iterative Computations Matrix K-means Multiplication Performance of K-Means Parallel Overhead Matrix Multiplication SALSA

  23. Kmeans Clustering Time for 20 iterations • Iteratively refining operation • New maps/reducers/vertices in every iteration Large • File system based communication Overheads • Loop unrolling in DryadLINQ provide better performance The overheads are extremely large compared to MPI • • CGL-MapReduce is an example of MapReduce++ -- supports MapReduce model with iteration (data stays in memory and communication via streams not files) SALSA

  24. MapReduce++ (CGL-MapReduce) Pub/Sub Broker Network Map Worker M Worker Nodes Reduce Worker R D D MR User D MRDeamon M M M M Driver Program R R R R Communication File System Data Split • Streaming based communication Intermediate results are directly transferred from the map tasks to • the reduce tasks – eliminates local files • Cacheable map/reduce tasks - Static data remains in memory • Combine phase to combine reductions • User Program is the composer of MapReduce computations Extends the MapReduce model to iterative computations • SALSA

  25. SALSA HPC Dynamic Virtual Cluster Hosting Monitoring Infrastructure SW-G SW-G SW-G SW-G Using SW-G Using SW-G Using Using Using Using Hadoop Hadoop DryadLINQ Hadoop DryadLINQ Hadoop Windows Server Cluster Switching from Linux Bare- Linux Linux on 2008 Bare- system to Xen VMs to Windows 2008 Bare-system Xen system HPC XCAT Infrastructure iDataplex Bare-metal Nodes (32 nodes) SW-G : Smith Waterman Gotoh Dissimilarity Computation – A typical MapReduce style application SALSA

  26. Monitoring Infrastructure Monitoring Interface Pub/Sub Broker Network Virtual/Physical Clusters Summarizer XCAT Infrastructure Switcher iDataplex Bare-metal Nodes (32 nodes) SALSA

  27. SALSA HPC Dynamic Virtual Clusters SALSA

Recommend


More recommend