A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna, Austria ISGC 2011 Biomedicine & Life Science Applications 25.03.2011
Contents • Description of Application Mapping tryptid peptide fragmentation mass spectra data by utilizing the ProMEX database • Description of Infrastructure The Vienna Cloud Environment (VCE) The Vienna Cloud Environment (VCE) • Cloud-enabling molecular systems biology application • Evaluation of ProMEX Application • Ongoing work Martin Köhler - 2
Problem statement of molecular systems biologists • Molecular systems biology community has built – ProMEX database storing public mass spectral references – 23810 tryptic peptide product ion spectra entries • Chlamydomonas reinhardii, Arabidopsis thaliana, … – http://promex.pph.univie.ac.at/promex/ • Application – Matching tryptic peptide fragmentation mass spectra data against mass spectral reference database (ProMEX) • ProMEX algorithm – Matches input file to each entry in ProMEX database – Computes similarity of input file to database entries Martin Köhler - 3
ProMEX database – Web interface Martin Köhler - 4
Recast application to Cloud programming model and framework • ProMEX database – Increasing amount of data • User requests – Execution of parameter studies – Execution via Client API – • MoSys application • MoSys application – Algorithm compares input file with each database entry – Many independent tasks -> no need for communication – Use of massively parallel programming model • Provisioning of MoSys application – On demand compute and storage resources – MapReduce Programming Model – Vienna Cloud Environment (VCE) Martin Köhler - 5
Recast application to MapReduce model • Migrate ProMEX database to HDFS • Each map task compares – input file with local part of database • Similarity (percentage) of all entries reduced and sorted • Parameter studies supported via DistributedCache Martin Köhler - 6
Vienna Cloud Environment (VCE) • Cloud framework for on-demand supercomputing – Software as a Service (SaaS) – Web Service and Virtualization Technologies • Partly developed within EU Projects: – FP 5 GEMSS - Grid-enabling Medical Simulation Services – FP 6 @neurIST - Integrated Biomedical Informatics for the Management of Cerebral Aneurysms – FP 7 VPHShare - Virtual Physiological Human: Sharing for Healthcare • Provisioning of virtual appliances – Scientific applications – Data access and integration – Scientific workflows • VCE enables resources as – Web Services – hosted on virtual appliances Martin Köhler - 7
VCE Virtual Appliances • Data Appliances – Data access and integration – Based on OGSA-DAI/DQP – Integrated data mediation engine • Global data scheme • Mapping rules • Workflow Appliances – Integrated Workflow enactment engine (WEEP) – composed execution of BPEL workflows • Application Appliances – Providing applications as Services – HPC applications (OpenMP, MPI) – MapReduce application • Adaptive Execution framework Martin Köhler - 8
VCE MoSys Architecture • VCE Web Service interface • Resources – Local cluster resources – Private Cloud • Job execution – DRMAA – Sun Grid Engine (SGE) – Dynamic creation and configuration of Hadoop execution environment Martin Köhler - 9
System Configuration • Recasting and optimizing an application for MapReduce results in configuration challenges at three layers: • Application layer – Configuring parameter studies • Execution framework (Apache Hadoop) • Execution framework (Apache Hadoop) – Number of Reduce Tasks – Memory Allocation – Parallel tasks,… • Resources – Compute Resources: number of nodes – Storage Resources (HDFS): block size, replication factor Martin Köhler - 10
Evaluation of application • Utilization of CORA cluster resources – 72 CPU-cores (8 x 2 x Intel Xeon quad-core X5550) – 9088 GPU-cores (8 x 2 x Tesla C2050) – main memory of 216GB – Disk space of 64TB – Disk space of 64TB • Private Cloud environment – 4 nodes with two 12-core AMD Opteron™ – main memory is 192GB – KVM Virtualization Martin Köhler - 11
Configuration of experiments • First experiments on – Resource scaling – Parameter studies • Database size – 100 GB and 500 GB – Block Size 128 MB, Replication factor: 1 – Block Size 128 MB, Replication factor: 1 • Hadoop Configuration – 8 parallel map and reduce tasks per node – new JVM per task – 1GB memory per JVM • Parameter studies – Number of input files: 1 to 1000 Martin Köhler - 12
Performance Results: Node Scaling Martin Köhler - 13
Performance Results: Parameter Study Martin Köhler - 14
Ongoing Work • Performance tests regarding virtual execution nodes – Regarding to integration of VMs into Cluster as Hadoop execution nodes hosted in private/public Clouds • Adaptive Framework for automatic configuration of application, environment, and job – Based on MAPE-K loop (autonomic computing) – Hadoop environment configuration • Number of map/reduce tasks, memory allocation • Parallel tasks per node, hadoop scheduler, … – HDFS configuration • Block size, replication factor – Resources • Data-local computation • Number of resources (cpus) Martin Köhler - 15
Questions? Questions? Martin Köhler - 16
Recommend
More recommend