parallelization techniques applying map reduce and cross
play

Parallelization techniques: Applying Map, Reduce and Cross concepts - PowerPoint PPT Presentation

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies


  1. Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors � Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies altintas@sdsc.edu bioKepler.org bioKepler - September, 2012 1

  2. What is Parallelization? � bioKepler.org bioKepler - September, 2012 2

  3. What is Parallelization? � bioKepler.org bioKepler - September, 2012 3

  4. Distributed Computing Environments � Figure 1 FROM: “ Cloud Computing and Grid Computing 360-Degree Compared ”, Ian Foster, Yong Zhao, Ioan Raicu, Shiyong Lu. Grid Computing Environments Workshop (GCE), 2008. bioKepler.org Figure 1: Grids and Clouds Overview bioKepler - September, 2012 4 Grid Computing aims to “enable resource sharing and

  5. Parallelization Solutions in Distributed Environments � • Traditional parallel programming interfaces � – Examples: MPI and OpenMP � – Hard to implement � – Original sequential tools cannot be reused � • Parallel job execution � – Examples: SGE and Condor � – Original sequential tools can be reused � – Create small jobs by splitting data or tasks � – Hard to achieve data locality for each job � • Data parallel job execution � – Examples: Hadoop and Stratosphere � – Original sequential tools can be reused � – Support customized and automatic data partition and distribution � – Support data locality for each job through special distributed file system, HDFS � bioKepler.org bioKepler - September, 2012 5

  6. Data Parallel Task Execution � • Static executables run as processes � • Independent data items are assigned to processes � P1 D1 D5 P2 D2 D6 P3 D3 D7 P4 D4 D8 bioKepler.org bioKepler - September, 2012 6

  7. Distributed Data Parallel (DDP)Task Execution � • Static executables run as processes on distributed environments � • Independent data items are assigned to processes � P1 D1 D5 D2 D6 P2 P3 D3 D7 P4 D4 D8 bioKepler.org bioKepler - September, 2012 7

  8. MapReduce: 
 � A Typical DDP Execution Pattern • Chop the data based on a feature of interest � � ( value ) � � � � ( key ) � • Iterate a function on each value � • Order the intermediate data products’ � � � � ( intermediate value ) � • Stitch the intermediate values � • Can execute using a specialized engine Examples: Hadoop and Nephele bioKepler.org bioKepler - September, 2012 8

  9. Many Other DDP Patterns � Images taken from: http://www.stratosphere.eu bioKepler.org bioKepler - September, 2012 9

  10. Distributed Data-Parallel bioActors � • Set of steps to execute a bioinformatics tool in DDP environment � • Customized from the ExecutionChoice actor � • Includes: � – Data-parallel patterns, e.g., Map, Reduce, Cross, All-Pairs, etc., to specify data grouping � – I/O to interface with storage � – Data format specifying how to split and join � bioKepler.org bioKepler - September, 2012 10

  11. A Workflow with Three bioActors � BLASTALL bioKepler.org bioKepler - September, 2012 11

  12. Configuring the BLASTALL bioActor � bioKepler.org bioKepler - September, 2012 12

  13. Inside the LocalExecution Tab � External Execution bioKepler.org bioKepler - September, 2012 13

  14. Inside the MapReduce Tab � Stratosphere Blast bioKepler.org bioKepler - September, 2012 14

  15. Inside the MapReduce Tab � bioKepler.org bioKepler - September, 2012 15

  16. BLASTALL with MapReduce � bioKepler.org bioKepler - September, 2012 16

  17. Inside the Stratopshere Blast � bioKepler.org bioKepler - September, 2012 17

  18. DDP BLAST Workflow via Splitting Query Sequences � Switch director to work with other DDP engines, such as Hadoop � execute with data partition � bioKepler.org bioKepler - September, 2012 18

  19. DDP BLAST Workflow using Cross and Reduce � Same reduce sub-workflow with the Map workflow � Reference data partition for each execution � Query data partition for each execution � bioKepler.org bioKepler - September, 2012 19

  20. What if the bioActor I need is not available? � ExecutionChoice bioKepler.org bioKepler - September, 2012 20

  21. DDP bioActor Usage Model � bioActor Library 1. Search 4c. Save in Library A1 A2 An 4b. Add to User: Larger Workflow 2a. Choose 2b. Choose Workflow Developer Specific Generic DDP Workflow Director DDP DDP Blast Generic 3. Add to Workflow 2b. Create 4a. Execute Sub-Workflow Results bioKepler.org bioKepler - September, 2012 21

  22. 
 NEXT: 
 Kepler Interface and Introductory Examples on Using Kepler 
 � Daniel Crawl 1st Workshop on bioKepler Tools and Its Applications bioKepler.org bioKepler - September, 2012 22

Recommend


More recommend