dispel4py a python framework for data intensive escience
play

dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 - PowerPoint PPT Presentation

Virtual Earthquake and seismology Research Community e-science environment in Europe Project 283543 FP7-INFRASTRUCTURES-2011-2 www.verce.eu info@verce.eu dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 15


  1. Virtual Earthquake and seismology Research Community e-science environment in Europe Project 283543 – FP7-INFRASTRUCTURES-2011-2 – www.verce.eu – info@verce.eu dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 15 November 2015 Amy Krause et al. University of Edinburgh www.verce.eu WP7 VERCE Science Gateway

  2. Outline � • Introduction � • dispel4py features � • dispel4py basic concepts � • dispel4py advanced concepts � • dispel4py workflows � • Evaluations � • Current work � • Conclusions and future work �

  3. Introduction – � What it is dispel4py ? � • User-friendly tool � • Develop scientific methods and applications on local machines � • Run them at scale on a wide range of computing resources without making changes �

  4. Introduction – � What it is dispel4py ? � Open source project: www.dispel4py.org & https://github.com/dispel4py/dispel4py � Publications: � IJHPCA journal, “Data-Intensive High Performance Computing” Special Issue, 2015 • 11 th IEEE eScience Conference, 2015 � • Book Chapter in “Conquering Big Data Using High Performance”, 2015 � • Users Users: � Computational Seismologists � • Astrophysicists � • BioInformatics � • Contributors: � University of Edinburgh � • KNMI � • LMU � •

  5. dispel4py features � Stream-based Stream-based � • Tasks are connected by streams � • Multiple streams in & out � • Optimisation based on avoiding IO � Python Python for describing tasks and connections � Modular � Multiple enactment systems �

  6. dispel4py basic concepts – Processing element � PEs represent the basic computational unit � Data transformation, scientific method, service request � PEs are the “Lego bricks” of tasks and users can assemble them into a workflow as they wish � General PE features � Consumes Consumes any number and types of input streams � Produce any number and types of output streams �

  7. dispel4py basic concepts – Instance and graph � Graph � Pipeline � • Topology of the workflow: connections Split & Merge � between PEs � • Users focus on the algorithm to implement or the service to use � Tree �

  8. dispel4py basic concepts – Instance and graph � Pipeline � PE Instance � • Executable copy of a PE that runs in a process. � • Each PE is translated into one or more instances in run-time �

  9. dispel4py basic concepts – � � Groupings � “Grouping by” a feature (MapReduce) � All data items that satisfy the same features are guaranteed to be delivered to the same instance of a PE � t=10:00 � p2 � p1 � p3 P3 � P1 � P2 � p2 � t=10:00 � p2 � p3 t=11:00 �

  10. dispel4py basic concepts – � Groupings � One-To-All � P3 - grouping “all”: P2 instances send p2 � p3 � copies of their output P3 � P1 � P2 � p1 � data to all the p2 � p3 � connected instances � Global � P3 - grouping “global”: p3 � All the instances of P2 p2 � P3 � P1 � P2 � send all the data to one p3 � p1 � instance of P3 � p2 � p3 �

  11. dispel4py basic concepts – � � � Composite PE and partition � Composite PE � • Sub-workflow in a PE � • Hides the complexity of an underlying process � • Treated like any other PE �

  12. dispel4py basic concepts – � � Composite PE and partition � Partition � • PEs wrapped together � • Run several PEs in a single process �

  13. dispelp4y basic concepts– � Example of a dispel4py workflow � from dispel4py.workflow_graph import WorkflowGraph pe1 = FilterTweet() PEs objects � pe2 = CounterHashTag() pe3 = CounterLanguage() Graph � pe4 = Statistics() graph = WorkflowGraph ( ) Connections � Users only have to graph.connect(pe1,’hash_tag’,pe2,’input’) graph.connect(pe1,’language’,pe3,’input’) implement: � graph.connect(pe2,’hash_tag_count’,pe4,’input1’) • PEs � graph.connect(pe3,’language_count’,pe4,’input2’) • Connections �

  14. dispelp4y basic concepts– � Example of a PE � Inputs & � outputs � Class filterTweet(GenericPE): __init__( self ): 
 GenericPE.init (self) 
 self.add_output(’hash_tags ’) self.add_output(’language’) Logic � of PE � process ( self, inputs ): twitterData = inputs[’input’] for line in twitterData: tweet = json.loads( line ) 
 Users only have to language = tweet[u’lang’].encode(’utf − 8’) implement: � text = tweet[u’text ’].encode(’utf − 8’) • PEs � hashtags=re.findall(r”#(\w+)”, text) • Connections � self.write(‘hash_tags’, hashtags) self.write(’language’, language) Stream out � data �

  15. dispel4py advanced concepts – � Mappings � Sequential • Sequential mapping for local testing • Ideal for local resources: Laptops and Desktops • Multiprocessing • Python’s multiprocessing library • Ideal for shared memory resources • MPI Distributed Memory, message-passing parallel programming model • Ideal for HPC clusters • STORM • Distributed Real-Time computation System • Fault-tolerant and scalable • Runs all the time • SPARK (Prototype)

  16. dispel4py advanced concepts – � Provenance � source � end � PEs WEB prov api Users sers ca can select select wh which ich met metadata to o st store ore Sea earch rches es ov over er prod roduct cts s met metadata wit within in and across cross ru runs Data down ownloa load and prev review iew Capturin ring of of Errors rrors for or Dia iagnost ostic ic purp rposes oses Data Fa Fabric: ric: Mult lti i direct irection ional l navig igation ions s across cross data dep epen enden encies cies W3C 3C PROV-D -DM as s ref referen erence ce mod model. el.

  17. VERCE project � • The VERCE project provides a framework to the seismological community to exploit the increasingly large volume of seismological data : � • Support data-intensive and HPC applications � • e-Science Gateway for submitting applications � • Distributed and diversified data sources � • Distributed HPC resources on Grid, Cloud and HPC clusters � • Use cases – dispel4py : � • Seismic Noise Cross-Correlation � • Misfit calculation �

  18. dispel4py workflows- � Seismology, Cross Correlation � • Data intensive problem and it is commonly used in seismology • Phase 1- Preprocess: Time series data (traces) from seismic stations are preprocessed in parallel � • Phase 2: Cross-Correlation: Pairs all of the stations and List of calculates the cross-correlation for each pair (complexity 1000 O(n 2 )). � stations � trace read product write Input data: � Input data: xCorr Trace Prep Pairs Results 1000 stations as input data (150MB) Output data: � 499,500 cross-correlations (39GB Phase 1 : composite PE Phase 2 pipeline to prepare trace from a single seismometer Cross Correlation re de xcorr xcorr � de calc calc move xCorr decim fi lter white trend mean norm fft resp

  19. dispel4py workflows- � Seismology, Misfit Computation � • Phase 1 – Preprocess: Align and prepare traces � • Phase 2 – Misfit: Compare synthetic and observed data �

  20. dispel4py workflows – � Misfit visualisation �

  21. Evaluations – � Computing resources � Computing Computing � Terracorrelator Terracorrelator � SuperMUC SuperMUC � Amazon EC2 Amazon EC2 � EDIM1 EDIM1 � Resources Resources � Type � Shared- Cluster � Cloud � Cloud � memory � Enactment MPI, multi � MPI, multi � MPI, Storm, multi � MPI, Storm, Systems � multi � Nodes � 1 � 16 � 18 � 14 � Cores per 32 � 16 � 2 � 4 � Node � Total Cores � 32 � 256 � 36 � 14 � Memory � 2TB � 32GB � 4GB � 3GB � Workflows � xcorr, int_ext, xcorr, xcorr � xcorr, int_ext sentiment � sentiment � sentiment

  22. Evaluations – � Performance measures � Mode � Mode Terracorrelator Terracorrelator � SuperMuc SuperMuc � Amazon Amazon � EDIM1 � EDIM1 xcorr: � xcorr (32 cores) (32 cores) � (256 cores) (256 cores) � (36 cores) � (36 cores) (14 cores (14 cores � 4 shared) 4 shared) � 1000 stations � MPI � 1501.32 � 1093.16 16862.73 38656.94 � Input 150MB � (~25minutes) � (~19minutes 19minutes) � (~5hours) � (~11 hours) � Output 39GB � multi � 1332 .20 � (~23minutes) � Storm � 27898.89 � 120077.123 � (~8 hours) � (~33 hours) � Mode Mode � Terracorrelator Terracorrelator � EDIM1 � EDIM1 int_ext int_ext: � (32 cores) (32 cores) � (14 cores, (14 cores, � 1050 galaxies � 1050 galaxies 4 shared) 4 shared) � MPI � 31.60 � 96.12 � multi � 14.50 � 14.50 101.2 � Storm � 30.2 �

  23. Current work � • Diagnosis tool � • How to partition the workflow automatically � • How many processes execute each partition � • Run-time Stream Adaptive Compression �

  24. dispel4py – Monitoring �

  25. Conclusions and Future work � • Python library for streaming and data-intensive streaming and data-intensive processing processing � • Users express their computational activities � • Same workflow executed in several parallel systems � • Easy to use and open � • Future Future � • Support for PE failures � • Select the best computing resource and mapping �

Recommend


More recommend