time cost trade offs of pipelined dataflow applications
play

Time-Cost Trade-offs of Pipelined Dataflow Applications Jonathan Kho - PowerPoint PPT Presentation

Time-Cost Trade-offs of Pipelined Dataflow Applications Jonathan Kho 1 , Erik Saule 4 , Anas AbuDoleh 1 , Xusheng Wang 1 , Hao Ding 2 , Kun Huang 1 , Raghu Machiraju 1 , 2 , urek 1 , 3 Umit V. C ataly 1 Biomedical Informatics, The Ohio


  1. Time-Cost Trade-offs of Pipelined Dataflow Applications Jonathan Kho 1 , Erik Saule 4 , Anas AbuDoleh 1 , Xusheng Wang 1 , Hao Ding 2 , Kun Huang 1 , Raghu Machiraju 1 , 2 , ¨ urek 1 , 3 Umit V. C ¸ataly¨ 1 Biomedical Informatics, The Ohio State University 2 Computer Science and Engineering, The Ohio State University 3 Electrical and Compute Engineering, The Ohio State University 4 Computer Science, UNC Charlotte Aussois 2016 Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 1 / 27

  2. Featuring the three wise men (that never heard of this talk) Pierre-Francois Loris Guochuan Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 2 / 27

  3. Context Cloud computing promises on demand resources Different types of computing resources are available Arbitrary speedups are in principle possible The catch is that you have to pay for resources used The problem becomes a tradeoff between the runtime of an application and cost of executing it In this presentation, we show that the pipelined dataflow abstraction is good for expressing this tradeoff because runtime is “easy” to predict. We use a particular imaging application to examplify the technique. Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 3 / 27

  4. Feature Extraction from Histopathological Slides LBP feature Biopsy slides SuperPixel Non-blank patch extraction segmentation Preprocess Varying sizes in the order of 100k × 100k pixels. Aperio Format with thumbnail (about 1GB/file, 24GB uncompressed) Available public repository (TCGA) with 1000s of participants samples 3 slides per patients. Can be used to predict whether the biopsy is cancerous Will consider two instances: twoparticipants (2 participants) and allslides (42 participants) Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 4 / 27

  5. Outline Introduction 1 Predicting Runtime of Pipelined Dataflow Application 2 A Flowshop Problem 3 Time-Cost Tradeoff 4 Conclusion 5 Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 5 / 27

  6. Pipelined workflow Advantages Layout Sequential processes Image analysis Heterogeneous LBP Partitioner Reader Segmentation Features Replication for throughput Reader discards background tiles Comm/Comp overlap Placement Application Medical imaging Node 2 4 CPU Cores Stock option pricing Node 0 Node 1 1 CPU Core 1 CPU Core Synthetic Aperture Node 3 4 CPU Core + 1 Radar GPU Incremental graph algorithm Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 6 / 27

  7. How to predict runtime ? Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 7 / 27

  8. How to predict runtime ? In a pipelined system what matters is the steady state! The throughput is given by the most loaded node. FOO Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 7 / 27

  9. Runtime in a (simple) pipelined dataflow model One-to-one mapping Model With one processor per stage An application of M stages The execution is constrained by the slowest stage J identical jobs Period P = max i p i Stage i processes a job in p i Throughput T = 1 P J P +( ∑ p i - P ) J P p 1 p 3 P = p 2 P = S3 S2 S1 1 2 3 4 5 6 7 8 time Erik Saule (UNC Charlotte) p 1 = 1 , p 2 = 2 , p 3 = 1 . 5 Time-Cost in Pipelined Applications Aussois 2016 8 / 27

  10. Runtime in a (more complex) pipelined dataflow model Replication It is possible in some application to replicate some stages to increase the throughput If stage i is replicated r i times i processes at a rate τ i = r i p i Throughput T = max i τ i Period P = 1 T Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 9 / 27

  11. Runtime in a (more complex) pipelined dataflow model Heterogeneity Replication It is possible in some application to It is possible in some application to replicate on different systems. replicate some stages to increase If stage i is replicated on a the throughput CPU and a GPU If stage i is replicated r i times i processes at a rate i processes at a rate τ i = r i 1 1 p i τ i = + p cpu p gpu Throughput T = max i τ i i i Throughput T = max i τ i Period P = 1 Period P = 1 T T Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 9 / 27

  12. Runtime in a (more complex) pipelined dataflow model Heterogeneity Replication It is possible in some application to It is possible in some application to replicate on different systems. replicate some stages to increase If stage i is replicated on a the throughput CPU and a GPU If stage i is replicated r i times i processes at a rate i processes at a rate τ i = r i 1 1 p i τ i = + p cpu p gpu Throughput T = max i τ i i i Throughput T = max i τ i Period P = 1 Period P = 1 T T These two techniques combine ! Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 9 / 27

  13. Experimental settings and model calibration Tile prediction Machine based on thumbnail: 32-node cluster Two Xeon E5520 (quad core) 400 350 300 Total An NVIDIA C2050 Valid 250 Tiles 200 DDR4x Infiniband 150 100 50 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 Software Slide Index g++ 4.8.1 mvapich2 2.2 Model Calibration DataCutter (dcmpi) Estimated Estimated Slide Filesize Width Height Total Tiles Valid Tiles TCGA-BH-A18V-01A-01-TSA 432.93MB 98,631 33,244 225 78 Openslide 3.4.1 TCGA-BH-A18J-01A-01-TSA 322.01MB 112,037 29,845 224 75 ImAn: gSLIC CPU / GPU Proc. Time Local τ IA Average τ IA Speedup NVIDIA Tesla C2050 447.41 s 2.924M px/s nvcc 7.0.27 422.03 s 2.981M px/s 2.953M px/s 1 Intel Xeon E5520 (7 cores) 399.11 s 3.278M px/s 378.83 s 3.321M px/s 3.299M px/s 1.117 Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 10 / 27

  14. A first log MRT 1200 Total Read 1000 White 800 Tiles 600 MAT 400 Valid 200 Analyzed 0 0 50 100 150 200 250 300 350 400 450 500 Walltime (s) 1 Reader. 3 GPUs. Two Patients. Natural ordering. (Eventually ImAn idles because too many White are read.) Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 11 / 27

  15. How to fix this ? Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 12 / 27

  16. How to fix this ? The Valid tiles are more computa- tionally expensive than the White ones. Valid first should work fine! x Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 12 / 27

  17. Valid First does not always work MRT MAT 16000 Total Read 14000 White 12000 10000 Tiles 8000 6000 4000 Valid Analyzed 2000 0 0 2000 4000 6000 8000 10000 Walltime (s) 1 Reader. 2 GPUs. All Slides. Valid First. (The system has bounded memory and eventually Reader stalls.) Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 13 / 27

  18. Outline Introduction 1 Predicting Runtime of Pipelined Dataflow Application 2 A Flowshop Problem 3 Time-Cost Tradeoff 4 Conclusion 5 Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 14 / 27

  19. Flowshop Deciding which job to process next in its simplest form is a Flowshop problem. Model M stages Bad News J jobs NP-Complete in this form job j in stage i takes time p i , j That is actually an abstraction of the real problem Order the job to minimize the makespan Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 15 / 27

  20. How to make the problem computationally simpler? Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 16 / 27

  21. How to make the problem computationally simpler? Since you have categories of jobs, the p i , j matrix is actually low rank. That helped in R || Cmax . Maybe it helps here? x Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 16 / 27

  22. Interleave schedule Algorithm Build k batches with s c = J c k jobs of category c Insight We have: Asymptotic optimality C categories of jobs Each batch can be seen as a meta J c jobs in category c job in a one-to-one mapping. When k goes to infinity, the J c are large numbers makespan of the flowshop problem Sounds like something cyclic converges to the optimal value of should work the pipelined scheduling problem. So with lots of jobs, performance is good. Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 17 / 27

  23. Dismissed Constraints Low-Rank Divisibility Categories and low-rank are The number of jobs might be slightly different. (low rank admits prime, but rational approximation linear combination of categories.) works just fine. Low-rank can be solved by some weighted interleave schedule Heterogeneity Called hybrid problem in the Communication flowshop world. Often modeled as an additional Heterogeneous just makes different stage of processing. p i , j . Blocking Writes Onlineness As long as one batch does not Non-clairvoyance can be solved saturate memory, pipelining will with random ordering. happen gracefully. Erik Saule (UNC Charlotte) Time-Cost in Pipelined Applications Aussois 2016 18 / 27

Recommend


More recommend