dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 - PowerPoint PPT Presentation

Virtual Earthquake and seismology Research Community e-science environment in Europe Project 283543 – FP7-INFRASTRUCTURES-2011-2 – www.verce.eu – info@verce.eu dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 15 November 2015 Amy Krause et al. University of Edinburgh www.verce.eu WP7 VERCE Science Gateway

Outline � • Introduction � • dispel4py features � • dispel4py basic concepts � • dispel4py advanced concepts � • dispel4py workflows � • Evaluations � • Current work � • Conclusions and future work �

Introduction – � What it is dispel4py ? � • User-friendly tool � • Develop scientific methods and applications on local machines � • Run them at scale on a wide range of computing resources without making changes �

Introduction – � What it is dispel4py ? � Open source project: www.dispel4py.org & https://github.com/dispel4py/dispel4py � Publications: � IJHPCA journal, “Data-Intensive High Performance Computing” Special Issue, 2015 • 11 th IEEE eScience Conference, 2015 � • Book Chapter in “Conquering Big Data Using High Performance”, 2015 � • Users Users: � Computational Seismologists � • Astrophysicists � • BioInformatics � • Contributors: � University of Edinburgh � • KNMI � • LMU � •

dispel4py features � Stream-based Stream-based � • Tasks are connected by streams � • Multiple streams in & out � • Optimisation based on avoiding IO � Python Python for describing tasks and connections � Modular � Multiple enactment systems �

dispel4py basic concepts – Processing element � PEs represent the basic computational unit � Data transformation, scientific method, service request � PEs are the “Lego bricks” of tasks and users can assemble them into a workflow as they wish � General PE features � Consumes Consumes any number and types of input streams � Produce any number and types of output streams �

dispel4py basic concepts – Instance and graph � Graph � Pipeline � • Topology of the workflow: connections Split & Merge � between PEs � • Users focus on the algorithm to implement or the service to use � Tree �

dispel4py basic concepts – Instance and graph � Pipeline � PE Instance � • Executable copy of a PE that runs in a process. � • Each PE is translated into one or more instances in run-time �

dispel4py basic concepts – � � Groupings � “Grouping by” a feature (MapReduce) � All data items that satisfy the same features are guaranteed to be delivered to the same instance of a PE � t=10:00 � p2 � p1 � p3 P3 � P1 � P2 � p2 � t=10:00 � p2 � p3 t=11:00 �

dispel4py basic concepts – � Groupings � One-To-All � P3 - grouping “all”: P2 instances send p2 � p3 � copies of their output P3 � P1 � P2 � p1 � data to all the p2 � p3 � connected instances � Global � P3 - grouping “global”: p3 � All the instances of P2 p2 � P3 � P1 � P2 � send all the data to one p3 � p1 � instance of P3 � p2 � p3 �

dispel4py basic concepts – � � � Composite PE and partition � Composite PE � • Sub-workflow in a PE � • Hides the complexity of an underlying process � • Treated like any other PE �

dispel4py basic concepts – � � Composite PE and partition � Partition � • PEs wrapped together � • Run several PEs in a single process �

dispelp4y basic concepts– � Example of a dispel4py workflow � from dispel4py.workflow_graph import WorkflowGraph pe1 = FilterTweet() PEs objects � pe2 = CounterHashTag() pe3 = CounterLanguage() Graph � pe4 = Statistics() graph = WorkflowGraph ( ) Connections � Users only have to graph.connect(pe1,’hash_tag’,pe2,’input’) graph.connect(pe1,’language’,pe3,’input’) implement: � graph.connect(pe2,’hash_tag_count’,pe4,’input1’) • PEs � graph.connect(pe3,’language_count’,pe4,’input2’) • Connections �

dispelp4y basic concepts– � Example of a PE � Inputs & � outputs � Class filterTweet(GenericPE): __init__( self ):   GenericPE.init (self)   self.add_output(’hash_tags ’) self.add_output(’language’) Logic � of PE � process ( self, inputs ): twitterData = inputs[’input’] for line in twitterData: tweet = json.loads( line )   Users only have to language = tweet[u’lang’].encode(’utf − 8’) implement: � text = tweet[u’text ’].encode(’utf − 8’) • PEs � hashtags=re.findall(r”#(\w+)”, text) • Connections � self.write(‘hash_tags’, hashtags) self.write(’language’, language) Stream out � data �

dispel4py advanced concepts – � Mappings � Sequential • Sequential mapping for local testing • Ideal for local resources: Laptops and Desktops • Multiprocessing • Python’s multiprocessing library • Ideal for shared memory resources • MPI Distributed Memory, message-passing parallel programming model • Ideal for HPC clusters • STORM • Distributed Real-Time computation System • Fault-tolerant and scalable • Runs all the time • SPARK (Prototype)

dispel4py advanced concepts – � Provenance � source � end � PEs WEB prov api Users sers ca can select select wh which ich met metadata to o st store ore Sea earch rches es ov over er prod roduct cts s met metadata wit within in and across cross ru runs Data down ownloa load and prev review iew Capturin ring of of Errors rrors for or Dia iagnost ostic ic purp rposes oses Data Fa Fabric: ric: Mult lti i direct irection ional l navig igation ions s across cross data dep epen enden encies cies W3C 3C PROV-D -DM as s ref referen erence ce mod model. el.

VERCE project � • The VERCE project provides a framework to the seismological community to exploit the increasingly large volume of seismological data : � • Support data-intensive and HPC applications � • e-Science Gateway for submitting applications � • Distributed and diversified data sources � • Distributed HPC resources on Grid, Cloud and HPC clusters � • Use cases – dispel4py : � • Seismic Noise Cross-Correlation � • Misfit calculation �

dispel4py workflows- � Seismology, Cross Correlation � • Data intensive problem and it is commonly used in seismology • Phase 1- Preprocess: Time series data (traces) from seismic stations are preprocessed in parallel � • Phase 2: Cross-Correlation: Pairs all of the stations and List of calculates the cross-correlation for each pair (complexity 1000 O(n 2 )). � stations � trace read product write Input data: � Input data: xCorr Trace Prep Pairs Results 1000 stations as input data (150MB) Output data: � 499,500 cross-correlations (39GB Phase 1 : composite PE Phase 2 pipeline to prepare trace from a single seismometer Cross Correlation re de xcorr xcorr � de calc calc move xCorr decim fi lter white trend mean norm fft resp

dispel4py workflows- � Seismology, Misfit Computation � • Phase 1 – Preprocess: Align and prepare traces � • Phase 2 – Misfit: Compare synthetic and observed data �

dispel4py workflows – � Misfit visualisation �

Evaluations – � Computing resources � Computing Computing � Terracorrelator Terracorrelator � SuperMUC SuperMUC � Amazon EC2 Amazon EC2 � EDIM1 EDIM1 � Resources Resources � Type � Shared- Cluster � Cloud � Cloud � memory � Enactment MPI, multi � MPI, multi � MPI, Storm, multi � MPI, Storm, Systems � multi � Nodes � 1 � 16 � 18 � 14 � Cores per 32 � 16 � 2 � 4 � Node � Total Cores � 32 � 256 � 36 � 14 � Memory � 2TB � 32GB � 4GB � 3GB � Workflows � xcorr, int_ext, xcorr, xcorr � xcorr, int_ext sentiment � sentiment � sentiment

Evaluations – � Performance measures � Mode � Mode Terracorrelator Terracorrelator � SuperMuc SuperMuc � Amazon Amazon � EDIM1 � EDIM1 xcorr: � xcorr (32 cores) (32 cores) � (256 cores) (256 cores) � (36 cores) � (36 cores) (14 cores (14 cores � 4 shared) 4 shared) � 1000 stations � MPI � 1501.32 � 1093.16 16862.73 38656.94 � Input 150MB � (~25minutes) � (~19minutes 19minutes) � (~5hours) � (~11 hours) � Output 39GB � multi � 1332 .20 � (~23minutes) � Storm � 27898.89 � 120077.123 � (~8 hours) � (~33 hours) � Mode Mode � Terracorrelator Terracorrelator � EDIM1 � EDIM1 int_ext int_ext: � (32 cores) (32 cores) � (14 cores, (14 cores, � 1050 galaxies � 1050 galaxies 4 shared) 4 shared) � MPI � 31.60 � 96.12 � multi � 14.50 � 14.50 101.2 � Storm � 30.2 �

Current work � • Diagnosis tool � • How to partition the workflow automatically � • How many processes execute each partition � • Run-time Stream Adaptive Compression �

dispel4py – Monitoring �

Conclusions and Future work � • Python library for streaming and data-intensive streaming and data-intensive processing processing � • Users express their computational activities � • Same workflow executed in several parallel systems � • Easy to use and open � • Future Future � • Support for PE failures � • Select the best computing resource and mapping �

dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 - PowerPoint PPT Presentation

Virtual Earthquake and seismology Research Community e-science environment in Europe Project 283543 FP7-INFRASTRUCTURES-2011-2 www.verce.eu info@verce.eu dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 15

eScience in the Netherlands Rob van Nieuwpoort R.vanNieuwpoort@esciencecenter.nl We work

gui4dispel4py A dispel4py GUI for visual workflow design Steven Rapp, Theano Stavrinos, Melissa

TCS (eScience) Personal CA Milan Sova Context TCS: TERENA SSL CA TERENA eScience SSL

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

eScience Projects in Projects in eScience Singapore Singapore Lawrence Wong National Grid

Building Virtual Communities with eScience Andy Parker Director, Cambridge eScience Centre What

Growi rowing ng resea research whi rch which com ch comput putes es Nick Jones Director

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

The Promise and Perils of Data Science in the Wild Data Science & Society Seminar | eScience

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

eScience on Distributed Infrastructure in Poland Marian Bubak AGH University of Science and

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

GRNET eScience platform for Big Data management Codename: orka Monday, February 1, 2016 Project

Tour de HPCycles Allan Snavely Wu Feng allans@sdsc.edu feng@lanl.gov San Diego Los Alamos

A Model of Patent Trolls Jay Pil Choi and Heiko Gerlach Very Preliminary 2 nd ATE Symposium UNSW

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Global well-posedness of the primitive equations of oceanic and atmospheric dynamics Jinkai Li

Multiobjective Multiobjective Genetic Algorithms for Genetic Algorithms for Multiscaling

of Cybersecurity in Smart Grid Deployments Prof. Dave Bakken School of Electrical Engineering

Performance-driven system Performance-driven system generation for distributed generation for

Policy Management in the Reliable Server Pooling Architecture Thomas Dreibholz Institute for

dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 - PowerPoint PPT Presentation

Virtual Earthquake and seismology Research Community e-science environment in Europe Project 283543 FP7-INFRASTRUCTURES-2011-2 www.verce.eu info@verce.eu dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 15

eScience in the Netherlands Rob van Nieuwpoort R.vanNieuwpoort@esciencecenter.nl We work

gui4dispel4py A dispel4py GUI for visual workflow design Steven Rapp, Theano Stavrinos, Melissa

TCS (eScience) Personal CA Milan Sova Context TCS: TERENA SSL CA TERENA eScience SSL

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

eScience Projects in Projects in eScience Singapore Singapore Lawrence Wong National Grid

Building Virtual Communities with eScience Andy Parker Director, Cambridge eScience Centre What

Growi rowing ng resea research whi rch which com ch comput putes es Nick Jones Director

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

The Promise and Perils of Data Science in the Wild Data Science &amp; Society Seminar | eScience

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

eScience on Distributed Infrastructure in Poland Marian Bubak AGH University of Science and

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

GRNET eScience platform for Big Data management Codename: orka Monday, February 1, 2016 Project

Tour de HPCycles Allan Snavely Wu Feng allans@sdsc.edu feng@lanl.gov San Diego Los Alamos

A Model of Patent Trolls Jay Pil Choi and Heiko Gerlach Very Preliminary 2 nd ATE Symposium UNSW

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Global well-posedness of the primitive equations of oceanic and atmospheric dynamics Jinkai Li

Multiobjective Multiobjective Genetic Algorithms for Genetic Algorithms for Multiscaling

of Cybersecurity in Smart Grid Deployments Prof. Dave Bakken School of Electrical Engineering

Performance-driven system Performance-driven system generation for distributed generation for

Policy Management in the Reliable Server Pooling Architecture Thomas Dreibholz Institute for

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

The Promise and Perils of Data Science in the Wild Data Science & Society Seminar | eScience