Workflows as an Operational Tool Scientific Computing using Data - PowerPoint PPT Presentation

Workflows as an Operational Tool Scientific Computing using Data Scien İ lkay ALTINTA Ş , Ph.D. Chief Data Science Officer, San Diego Supercomputer Center Founder and Director, Workflows for Data Science Center of Exce

SDSC is 31 Years Young! Providing Cyberinfrastructure for Research and Education Established ¡as ¡a ¡na.onal ¡ 1985 supercomputer ¡resource ¡center ¡in ¡1985 ¡ by ¡NSF ¡ A ¡world ¡leader ¡in ¡HPC, ¡data-‑intensive ¡ compu.ng, ¡and ¡scien.fic ¡data ¡ management ¡ Current ¡strategic ¡focus ¡on ¡“Big ¡Data” ¡ today and ¡“HPC ¡Cloud” ¡: ¡versa.le ¡compu.ng ¡ Two discoveries in dr design from 1987 and 1

SDSC continues to focus on versatile computing and big data! Gordon : ¡ First ¡ ¡ Flash-‑based ¡Supercompute for ¡Data-‑intensive ¡Apps ¡ ss Walker Group omet: Serving the Long ail of Science standard racks • 36 GPU nodes = 1944 nodes • 4 Large Memory nodes = 46,656 cores • 7 PB Lustre storage = 249 TB DRAM • High performance = 622 TB SSD virtualization Pflop/s

SDSC Data Science Office -- Expertise, Systems and Training DSO for Data Science Applications -- Big Data Platforms Applications Training Industry SDSC Expertise and Strengths SDSC Data Science Office (DSO) SDSC DSO is a collaborative virtual organization at SDSC for colle lasting innovation in data science research, development and educa

Computing Today has Many Shapes and Size Requires: • Data management • Data-driven method • Scalable tools for dynamic coordinati and stateful resource COMPUTING AT optimization BIG DATA SCALE • Skilled interdisciplin workforce Enables dynamic data-driven applications New era o Computer-Aided Drug Discovery Smart Cities Disaster Resilience and Response data science Personalized Precision Medicine Smart Grid and Energy Management Manufacturing

Needs and Trends for Scientific Computing under the Influence o Big Data and Cloud Systems • More ¡data-‑driven ¡ New era of data • More ¡dynamic ¡ science! • More ¡process-‑driven ¡ • More ¡collabora.ve ¡ • More ¡accountable ¡ • More ¡reproducible ¡ • More ¡interac.ve ¡ • More ¡heterogeneous ¡

Size Volume Application-Specific Value BIG DATA

Data Management and Processing in the Big Data Er has Unique Challenges! Scalable batch Volume processing Stream processing Velocity Extensible data storage Variety access and integration

These challenges come with new tools to tackle them Higher levels: Expression and interactivity Hive Pig Giraph Spark Storm Flink MapReduce HBase Cassandra MongoDB YARN HDFS Lower levels: Storage and scheduling

COORDINATION AND How do we use WORKFLOW MANAGEMENT these new tools and combine them DATA INTEGRATION with existing solutions in AND PROCESSING scientific computing and DATA MANAGEMENT data science? AND STORAGE

Example Big Data Processing Pipelines Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala .slideshare.net/BigDataCloud/big-data-analytics-with-google-cloud- Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html urce: ps://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink

COORDINATION AND ORKFLOW MANAGEMENT ACQUIRE ¡ PREPARE ¡ ANALYZE ¡ ACQUIRE ¡ PREPARE ¡ … ANALYZE ¡ ACQUIRE ¡ PREPARE ¡ ANALYZE ¡ REPORT ¡ ¡ ACT ¡ http://kepler-project.org

Research Challenges How ¡to ¡easily ¡program ¡a ¡workflow ¡using ¡the ¡Big ¡Data ¡PaYerns? ¡ How ¡to ¡parallelize ¡legacy ¡tools ¡for ¡Big ¡Data? ¡ hich ¡paYern(s) ¡to ¡use ¡under ¡which ¡Big ¡Data ¡engine ¡to ¡use, ¡e.g., ¡as ¡Hadoo Flink ¡or ¡Spark? ¡ nd-‑to-‑end ¡performance ¡predic.on ¡for ¡Big ¡Data ¡applica.ons/workflows ¡(h ong ¡to ¡run) ¡ Knowledge ¡based: ¡Analyze ¡performance ¡using ¡profiling ¡techniques ¡and ¡dependency ¡analysis ¡ Data ¡driven: ¡Predict ¡performance ¡based ¡on ¡execu.on ¡history ¡(provenance) ¡using ¡machine ¡learning ¡ techniques ¡ n ¡demand ¡resource ¡provisioning ¡and ¡scheduling ¡for ¡Big ¡Data ¡applica.ons where ¡and ¡how ¡to ¡run) ¡ Find ¡the ¡best ¡resource ¡alloca.on ¡based ¡on ¡execu.on ¡objec.ves ¡and ¡performance ¡predic.ons ¡ Find ¡the ¡best ¡workflow ¡and ¡task ¡configura.on ¡on ¡the ¡allocated ¡resources ¡

Using Big Data Patterns in Kepler Workflow  Visual ¡programming ¡ ¡ e ¡define ¡a ¡separate ¡DDP ¡ Distributed ¡Data-‑Parallel) ¡task/actor ¡  Parallel ¡execu=on ¡of ¡the ¡ sub-‑workflows ¡ ¡ or ¡each ¡paYern ¡  Exis=ng ¡actors ¡can ¡easily hese ¡DDP ¡actors ¡par..on ¡input ¡ reused ¡for ¡new ¡tools ¡ ata ¡and ¡process ¡each ¡par..on ¡ eparately ¡ ser-‑defined ¡func.ons ¡are ¡described ¡ as ¡sub-‑workflows ¡of ¡DDP ¡actors ¡ (a) Top-level Workflow DP ¡director: ¡executes ¡DDP ¡ orkflows ¡on ¡top ¡of ¡Big ¡Data ¡ engines ¡ (b) Sub-workflow for tRNAscan-SE (c) Sub-workflow for

orkflow ¡is ¡a ¡combina/on ¡of ¡modules ¡running ¡in ¡places ¡and ¡interac with ¡each ¡other ¡via ¡data ¡or ¡message ¡passing ¡via ¡a ¡connec/on ¡. orkflow ¡Performance ¡== ¡Composed ¡Module ¡Performance ¡on ¡an ¡Infrastructure ¡Insta

RAMMCAP metagene ¡ QC ¡ cd-‑hit ¡ hmmer ¡ bla tRNA ¡ Data size KB MB GB TB NGS ¡ CPU time Minute Hour Day Month Year QC ¡ metagene ¡ cd-‑hit ¡ tRNA ¡ hmmer ¡ blast ¡ Memory GB 10GB 100GB QC ¡ tRNA ¡ metagene ¡ hmmer ¡ blast ¡ cd-‑hit ¡ No need Parallel No Multi threading MPI Map Reduce QC ¡ cd-‑hit ¡ hmmer ¡ blast ¡ hmmer ¡ tRNA ¡ metagene ¡ blast ¡

ptimization of Heterogeneous Resource Utilization using bioKepler Execution Platforms Na=onal ¡ ¡ Local ¡Cluster ¡Resources ¡ Cloud ¡ ¡ Resources ¡ Resources ¡ (Gordon) ¡ (Co (Stamp (Lonestar) ¡

dd more traditional HPC and HTC workloads to this Dynamic data-driven coordination & resource optimization Requires: Ability to explore and scale on multiple platforms ? Are workflows increasingly becoming the dynamic operations research tool for science

Challenge: Make workflows more aware of distributed system and application state!

Some steps to get there … 1. Analyze each task in a workflow as an individual module based on all past executions of that executable o task. 2. Model workflow performance as an aggregate of predictions of individual tasks to form prediction for entire workflow. 3. Include system level analytics at the workflow level to make sure scheduling can use system level information into account in a dynamic data-driven way.

3. ¡Module ¡Performance ¡Predic=on ¡a 1. ¡Profiling ¡Framework ¡ Workflow ¡Composi=on ¡(f) ¡ Uses existing tools and computing systems! Computing is just one part of big data workflows… … new methods needed! RMSE ¡= ¡42.58 ¡sec ¡ Mean(Valida=on ¡Set) ¡= ¡1727.58 ¡ 3-‑a: ¡Module ¡Predic=on: ¡Single ¡Predictor ¡ ¡ ¡Feature ¡Selec=on ¡and ¡Training ¡ For ¡Two ¡Independent ¡SoRware ¡Tools ¡

Workflows for Data Science Center of Excellence at SDSC Wo WorDS.sdsc.edu Data-‑Parallel ¡Bioinforma/ Real-‑Time ¡Hazards ¡Management bioKeple bio pler.o .org ¡ ¡ wifire.uc wifir e.ucsd.edu sd.edu Focus ¡on ¡the ¡ ques:on, ¡ ¡ not ¡the ¡ technology! ¡ • Access and query data • Support exploratory design • Scale computational analysis • Increase reuse • Save time, energy and money • Formalize and standardize Goal: Methodology and tool development to build automated and operational workflow-driven solution architectures on big data and HPC platforms. Scalable ¡Automated ¡Molecular ¡Dynamics ¡and ¡Drug ¡Discovery nbc nbcr.uc .ucsd.e sd.edu du

Examples: Use of Workflows as an Application Integra Tool for “Big” Data and Computational Science

ACQUIRE ¡ PREPARE ¡ ANALYZE ¡ REPORT ¡ ¡ ACT ¡ COMMUNICATION AND FEEDBACK PROVENANCE COORDINATION AND SCALABILITY WORKFLOW MANAGEMENT EXPLORATION DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE

Workflows as an Operational Tool Scientific Computing using Data - PowerPoint PPT Presentation

Workflows as an Operational Tool Scientific Computing using Data Scien lkay ALTINTA , Ph.D. Chief Data Science Officer, San Diego Supercomputer Center Founder and Director, Workflows for Data Science Center of Exce SDSC is 31 Years Young!

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Cost-Efficient Resource Management for Scientific Workflows on the Cloud Ilia Pietri School of

Integrated Data Placement and Task Assignment for Scientific Workflows in Clouds Kamer Kaya

2010 Computing on Grids and Supercomputers Improving Many-Task Computing in Scientific Workflows

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata

Achieving Coordination Through Dynamic Construction of Open Workflows Louis Thomas, Justin

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

Overview of Scientific Workflows: Why Use Them? Blue Waters Webinar Series March 8, 2017 Scott

Scientific Computing Albert-Jan Yzelman (May 10, 2010) Scientific Computing is... a two-years

Workflow Plus Planned Assign Tool Features This tool allows workflows to be assigned a

On characterising and identifying mismatches in scientific workflows Khalid Belhajjame, Suzanne

Performance Advantages of Using a Burst Buffer for Scientific Workflows Andrey Ovsyannikov NERSC,

A Case for a Road Map Dr Giovanna Cruz Research Fellow Hospice Isle of Man Background To

Locomotion CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2017 Legged

FAILURE TO THRIVE: Disclosures RETHINKING OUR I have nothing to disclose. TREATMENT GOALS

Visualizing Clinical Profiles of Rare Metabolic Diseases Project Team: Zhong Huang, Nishant

Matrix Factorization For Topic Models Dr. Derek Greene Insight Latent Space Workshop

On the Use of NMF and curvHDR to Cluster Flow Cytometry Data e M. Maisog 1,2 , Andrea A. Barbo 2 ,

Integrating mol Integrating mol ecular Profiling ecular Profiling Into Patient Se election for

Well known and Little known Nation One persons perspective K. Adaricheva Department of