coordina ng the use of gpu and cpu for improving
play

Coordina(ng the Use of GPU and CPU for Improving Performance of - PowerPoint PPT Presentation

Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal


  1. Coordina(ng the Use of GPU and CPU for Improving Performance of Compute Intensive Applica(ons George Teodoro 1 , Rafael Sache>o 1 , Olcay Sertel 2 , Me(n Gurcan 2 , Wagner Meira Jr. 1 , Umit Catalyurek 2 , Renato Ferreira 1 1. Federal University of Minas Gerais, Brasil 2. The Ohio State University, US IEEE Cluster 2009 1

  2. Mo(va(on • High performance compu(ng – Large cluster of‐the‐shelf components – Mul(‐core/Many‐core – GPGPU • Massively parallel • High speedups compared to the CPU IEEE Cluster 2009 2

  3. Mo(va(on • But... GPU is not so fast in all scenarios... • Current frameworks – Assume exclusive use of GPU or CPU IEEE Cluster 2009 3

  4. Goal • Target heterogeneous environments – Mul(ple CPU‐cores/GPUs – Distributed environments • Efficient coordina(on of the devices – Scheduling tasks according to their specifici(es • High level programming abstrac(on IEEE Cluster 2009 4

  5. Outline • Anthill • Suppor(ng heterogeneous environments • Experimental evalua(on • Conclusions IEEE Cluster 2009 5

  6. Anthill • Based on the filter‐stream model (DataCu>er) – Applica(on decomposed into a set of filters – Communica(on using streams – Transparent instance copy – Data flow – Mul(ple dimensions of parallelism • Task parallelism • Data parallelism IEEE Cluster 2009 6

  7. Anthill A B C IEEE Cluster 2009 7

  8. Filter programming abstrac(on • Event driven interface – Aligned with the data flow model • User provide data processing func(ons to be invoked upon availability of data • System controls invoca(on of user func(on – Dependency analysis – Parallelism IEEE Cluster 2009 8

  9. Event handlers • User provided func(ons • Operate on data objects – Updates filter state (global) – May trigger communica(on – Returns aZer processing the data element • Gets invoked automa(cally when data is available – And dependencies are met

  10. Suppor(ng heterogeneous resources • Event handler implemented to mul(ple devices – Each filter may be implemented targe(ng the appropriate device • Mul(ple devices used in parallel • Anthill run‐(me chooses the device for each event IEEE Cluster 2009 10

  11. Heterogeneous support overview IEEE Cluster 2009 11

  12. Device scheduler • Assumes – Events are independent – Out‐of‐order execu(on • Scheduling policies – FCFS – first ‐come, first‐served – DWDR – dynamic weighted round robin • Orders events according to its performance to each device • Selects the event with the highest speedup • User given func(on IEEE Cluster 2009 12

  13. Neuroblastoma Image Analysis System • Classify (ssues in different subtypes of prognos(c significance • Very high resolu(on slides – Divided in smaller (les • Mul(‐resolu(on image analysis – Mimics the way pathologists examine them IEEE Cluster 2009 13

  14. Anthill implementa(on IEEE Cluster 2009 14

  15. Experimental results • Setup – 10 PCs with an Intel Core 2 Duo CPU 2.13GHz / NVIDIA GeForce 8800GT GPU – 4 PCs with a dual quad‐core AMD Opteron 2.00GHz processor/ NVIDIA GeForce GTX 260 – Input data: images of 26,742 (les using two resolu(on levels: 32x32 and 512x512 IEEE Cluster 2009 15

  16. NBIA tasks analysis – performance varia(on Dual quad-core AMD Opteron 2.00GHz/NVIDIA GeForce GTX260 IEEE Cluster 2009 16

  17. Heterogeneous scheduling analysis 16 + 1 = 30 ?? Recalc (%) 12 Resolu(on Low High 1 CPU core‐ FCFS 263 215 1 CPU core – DWRR 21592 4 IEEE Cluster 2009 17

  18. Heterogeneous scheduling analysis FCFS DWDR IEEE Cluster 2009 18

  19. Heterogeneous scheduling analysis # of CPU cores FCFS DWRR Low High Low High 1 637 58 10714 1 2 117 133 15748 2 3 1925 173 18614 5 4 2090 219 18634 28 5 2872 286 20070 40 6 3819 393 20147 76 7 4726 478 20266 57 IEEE Cluster 2009 19

  20. Distributed environment evalua(on IEEE Cluster 2009 20

  21. Conclusions • Rela(ve performance between CPU/GPU is data dependent • Adequate scheduling among heterogeneous processors doubled the performance of the applica(on • Neglect the CPU is a mistake • Data‐flow is an interes(ng model to exploit parallelism IEEE Cluster 2009 21

  22. Future work • New scheduling techniques • Execu(on in cluster with heterogeneity among the compu(ng nodes IEEE Cluster 2009 22

  23. Ques(ons? IEEE Cluster 2009 23

Recommend


More recommend