Using Cloud Technologies for Bioinformatics Applications MTAGS - PowerPoint PPT Presentation

Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology Institute Indiana University SALSA SALSA

Collaborators in SALSA Project Microsoft Research Indiana University Applications SALSA Technology Team Technology Collaboration Community Grids Lab Azure (Clouds) Bioinformatics, CGB and UITS RT – PTI Haixu Tang, Mina Rho, Dennis Gannon Geoffrey Fox Peter Cherbas, Qunfeng Dong Roger Barga Judy Qiu IU Medical School Dryad (Parallel Runtime) Scott Beason Gilbert Liu Christophe Poulain Jaliya Ekanayake Demographics (Polis Center) Thilina Gunarathne CCR (Threading) Neil Devadasan Thilina Gunarathne Jong Youl Choi George Chrysanthakopoulos Cheminformatics Yang Ruan David Wild, Qian Zhu DSS (Services) Seung-Hee Bae Physics Henrik Frystyk Nielsen Hui Li CMS group at Caltech (Julian Bunn) Saliya Ekanayake SALSA

Convergence is Happening Data intensive application (three basic activities): capture, curation, and analysis (visualization) Data Intensive Paradigms Cloud infrastructure and runtime Clouds Multicore Parallel threading and processes SALSA

MapReduce “File/Data Repository” Parallelism Map = (data parallel) computation reading and writing data Instruments Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Communication via Messages/Files Portals Map 1 Map 2 Map 3 Reduce /Users Disks Computers/Disks SALSA

Cluster Configurations Feature GCB-K18 @ MSR iDataplex @ IU Tempest @ IU CPU Intel Xeon Intel Xeon Intel Xeon CPU L5420 CPU L5420 CPU E7450 2.50GHz 2.50GHz 2.40GHz # CPU /# Cores per 2 / 8 2 / 8 4 / 24 node Memory 16 GB 32GB 48GB # Disks 2 1 2 Network Giga bit Ethernet Giga bit Ethernet Giga bit Ethernet / 20 Gbps Infiniband Operating System Windows Server Red Hat Enterprise Windows Server Enterprise - 64 bit Linux Server -64 bit Enterprise - 64 bit # Nodes Used 32 32 32 Total CPU Cores Used 256 256 768 Hadoop/ Dryad / MPI DryadLINQ DryadLINQ / MPI SALSA

Dynamic Virtual Cluster Architecture Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using Applications DryadLINQ, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping Apache Hadoop / MapReduce++ / Runtimes Microsoft DryadLINQ / MPI MPI Linux Virtual Windows Server Windows Server Linux Bare- Machines 2008 HPC Infrastructure 2008 HPC system software Bare-system Xen Virtualization Xen Virtualization XCAT Infrastructure Hardware iDataplex Bare-metal Nodes • Dynamic Virtual Cluster provisioning via XCAT • Supports both stateful and stateless OS images SALSA

Cloud Computing: Infrastructure and Runtimes • Cloud infrastructure: outsourcing of servers, computing, data, file space, etc. – Handled through Web services that control virtual machine lifecycles. Cloud runtimes: tools (for using clouds) to do data-parallel • computations. – Apache Hadoop, Google MapReduce, Microsoft Dryad, and others – Designed for information retrieval but are excellent for a wide range of science data analysis applications – Can also do much traditional parallel computing for data-mining if extended to support iterative operations – Not usually on Virtual Machines SALSA

Alu and Sequencing Workflow • Data is a collection of N sequences – 100’s of characters long – These cannot be thought of as vectors because there are missing characters – “Multiple Sequence Alignment” (creating vectors of characters) doesn’t seem to work if N larger than O(100) Can calculate N 2 dissimilarities (distances) between sequences (all pairs) • • Find families by clustering (much better methods than Kmeans). As no vectors, use vector free O(N 2 ) methods Map to 3D for visualization using Multidimensional Scaling MDS – also O(N 2 ) • • N = 50,000 runs in 10 hours (all above) on 768 cores Our collaborators just gave us 170,000 sequences and want to look at 1.5 million – • will develop new algorithms! • MapReduce++ will do all steps as MDS, Clustering just need MPI Broadcast/Reduce SALSA

Pairwise Distances – ALU Sequences 125 million distances 4 hours & 46 minutes Calculate pairwise distances for a collection • of genes (used for clustering, MDS) 20000 • O(N^2) problem DryadLINQ 18000 • “Doubly Data Parallel” at Dryad Stage 16000 MPI • Performance close to MPI 14000 12000 • Performed on 768 cores (Tempest Cluster) 10000 8000 Processes work better than threads 6000 4000 when used inside vertices 2000 100% utilization vs. 70% 0 35339 50000 SALSA

Hierarchical Subclustering SALSA

Pairwise Clustering 30,000 Points on Tempest 6 Clustering by Deterministic Annealing 5 MPI 4 Parallel Overhead 3 Thread 2 Thread Thread Thread MPI 1 Thread Thread Thread 0 1 2 4 4 4 8 8 8 8 8 8 8 16 16 16 16 16 24 32 32 48 48 48 48 48 64 64 64 64 96 96 128 128 192 288 384 384 480 576 672 744 Parallelism MPI -1 MPI SALSA

Dryad versus MPI for Smith Waterman Performance of Dryad vs. MPI of SW-Gotoh Alignment 7 Time per distance calculation per core (miliseconds) Dryad (replicated data) 6 Block scattered MPI 5 (replicated data) Dryad (raw data) 4 Space filling curve MPI (raw data) 3 Space filling curve MPI (replicated data) 2 1 0 0 10000 20000 30000 40000 50000 60000 Sequeneces Flat is perfect scaling SALSA

Hadoop/Dryad Comparison “Homogeneous” Data 0.012 Dryad 0.01 Time per Alignment (ms) 0.008 Hadoop 0.006 0.004 0.002 0 30000 35000 40000 45000 50000 55000 Number of Sequences Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex Using real data with standard deviation/length = 0.1 SALSA

Hadoop/Dryad Comparison Inhomogeneous Data I Randomly Distributed Inhomogeneous Data Mean: 400, Dataset Size: 10000 1900 1850 1800 1750 Time (s) 1700 1650 1600 1550 1500 0 50 100 150 200 250 300 Standard Deviation DryadLinq SWG Hadoop SWG Hadoop SWG on VM Inhomogeneity of data does not have a significant effect when the sequence lengths are randomly distributed Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes) SALSA

Hadoop/Dryad Comparison Inhomogeneous Data II Skewed Distributed Inhomogeneous data Mean: 400, Dataset Size: 10000 6,000 5,000 Total Time (s) 4,000 3,000 2,000 1,000 0 0 50 100 150 200 250 300 Standard Deviation DryadLinq SWG Hadoop SWG Hadoop SWG on VM This shows the natural load balancing of Hadoop MR dynamic task assignment using a global pipeline in contrast to the DryadLinq static assignment Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes) SALSA

Hadoop VM Performance Degradation 30% 25% 20% 15% 10% 5% 0% 10000 20000 30000 40000 50000 No. of Sequences Perf. Degradation On VM (Hadoop) • 15.3% Degradation at largest data set size SALSA

PhyloD using Azure and DryadLINQ • Derive associations between HLA alleles and HIV codons and between codons themselves SALSA

Mapping of PhyloD to Azure SALSA

PhyloD Azure Performance • Number of active Azure Efficiency vs. number of worker • workers during a run of PhyloD roles in PhyloD prototype run on Azure March CTP application SALSA

Iterative Computations Matrix K-means Multiplication Performance of K-Means Parallel Overhead Matrix Multiplication SALSA

Kmeans Clustering Time for 20 iterations • Iteratively refining operation • New maps/reducers/vertices in every iteration Large • File system based communication Overheads • Loop unrolling in DryadLINQ provide better performance The overheads are extremely large compared to MPI • • CGL-MapReduce is an example of MapReduce++ -- supports MapReduce model with iteration (data stays in memory and communication via streams not files) SALSA

MapReduce++ (CGL-MapReduce) Pub/Sub Broker Network Map Worker M Worker Nodes Reduce Worker R D D MR User D MRDeamon M M M M Driver Program R R R R Communication File System Data Split • Streaming based communication Intermediate results are directly transferred from the map tasks to • the reduce tasks – eliminates local files • Cacheable map/reduce tasks - Static data remains in memory • Combine phase to combine reductions • User Program is the composer of MapReduce computations Extends the MapReduce model to iterative computations • SALSA

SALSA HPC Dynamic Virtual Cluster Hosting Monitoring Infrastructure SW-G SW-G SW-G SW-G Using SW-G Using SW-G Using Using Using Using Hadoop Hadoop DryadLINQ Hadoop DryadLINQ Hadoop Windows Server Cluster Switching from Linux Bare- Linux Linux on 2008 Bare- system to Xen VMs to Windows 2008 Bare-system Xen system HPC XCAT Infrastructure iDataplex Bare-metal Nodes (32 nodes) SW-G : Smith Waterman Gotoh Dissimilarity Computation – A typical MapReduce style application SALSA

Monitoring Infrastructure Monitoring Interface Pub/Sub Broker Network Virtual/Physical Clusters Summarizer XCAT Infrastructure Switcher iDataplex Bare-metal Nodes (32 nodes) SALSA

SALSA HPC Dynamic Virtual Clusters SALSA

Using Cloud Technologies for Bioinformatics Applications MTAGS - PowerPoint PPT Presentation

Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology Institute Indiana University SALSA

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

The Cloud-y Future of Security Technologies Adam J. ODonnell, Ph.D. Director, Cloud

Design Consideration for Cloud Applications Agenda How cloud applications are different?

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

On the internal distance in the interlacement set r Ji Cern Serguei Popov

HOME IDIS Webinar Receipt Fund Types in the HOME Investment Trust Fund Local Account: Program

Infinite-dimensional integration by the Multivariate Decomposition Method Ian Sloan

Electric network for non-reversible Markov chains joint with Aron Folly M arton Bal

An introduction to random interlacements Artem Sapozhnikov University of Leipzig 23-25 May 2016

The convergence of three notions of limit for finite structures Alex Kruckman Indiana

2017 Inventory Update (IU) Chemicals & Polymers Stakeholders Information Session Webinar

Cheeger-Gromov L 2 -invariants of 3-manifolds Geunho Lim Indiana University Bloomington

Using Cloud Technologies for Bioinformatics Applications MTAGS - PowerPoint PPT Presentation

Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology Institute Indiana University SALSA

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

The Cloud-y Future of Security Technologies Adam J. ODonnell, Ph.D. Director, Cloud

Design Consideration for Cloud Applications Agenda How cloud applications are different?

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

On the internal distance in the interlacement set r Ji Cern Serguei Popov

HOME IDIS Webinar Receipt Fund Types in the HOME Investment Trust Fund Local Account: Program

Infinite-dimensional integration by the Multivariate Decomposition Method Ian Sloan

Electric network for non-reversible Markov chains joint with Aron Folly M arton Bal

An introduction to random interlacements Artem Sapozhnikov University of Leipzig 23-25 May 2016

The convergence of three notions of limit for finite structures Alex Kruckman Indiana

2017 Inventory Update (IU) Chemicals &amp; Polymers Stakeholders Information Session Webinar

Cheeger-Gromov L 2 -invariants of 3-manifolds Geunho Lim Indiana University Bloomington

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

2017 Inventory Update (IU) Chemicals & Polymers Stakeholders Information Session Webinar