computation vs memory systems pinning down accelerator
play

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks - PowerPoint PPT Presentation

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks Martha Kim and Stephen Edwards Columbia University Department of Computer Science June 19, 2010 AMAS-BT Workshop 2 Improved processor performance Fuller- Slower


  1. Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks Martha Kim and Stephen Edwards Columbia University Department of Computer Science June 19, 2010 AMAS-BT Workshop

  2. 2

  3. Improved processor performance Fuller- Slower featured programs software Larger HLLs & development abstractions teams Larus’ virtuous cycle 2

  4. Improved processor performance Fuller- Slower featured programs software Larger HLLs & development abstractions teams Larus’ virtuous cycle Power wall 2

  5. Improved processor performance Increases in Fuller- Slower power efficiency featured programs and software performance Larger HLLs & development abstractions teams Larus’ virtuous cycle Power wall 2

  6. Efficiency of Specialized Hardware General purpose ASIP processor Application- specific FPGA instruction 40-500x Power processor Field- programmable gate array Standard 10-350x cell ASIC Full 10-40x custom ASIC 3-10x 3-10x 6-8x Performance 3

  7. P0 A3 Shared Communication / Memory P1 A4 Accelerator System A0 A5 General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only A1 A6 when their function is needed. Potential benefits available only if applications actually make use of A2 A7 accelerators. 4

  8. P0 A3 Shared Communication / Memory P1 A4 Accelerator System A0 A5 General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only A1 A6 when their function is needed. Potential benefits available only if applications actually make use of A2 A7 accelerators. 4

  9. Talk Outline • Overall Vision • Accelerator System Model • Methodology Overview • Methodology in Practice • Image Rotation • JPEG • Conclusion 5

  10. System-Level Vision Programmer targets standard accelerator libraries (e.g., Java) Presence of high-level types supported by accelerators determines boundary of accelerator code Functions define boundaries of acceleration. 6

  11. CPU bar Each accelerator must do two things foo baz 1. Compute 2. Communicate externally 7

  12. CPU bar Each accelerator must do two things foo baz 1. Compute Computation Computation 2. Communicate externally 7

  13. CPU bar Each accelerator must do two things foo baz 1. Compute Communication Computation Computation 2. Communicate externally 7

  14. Talk Outline • Overall Vision • Accelerator System Model • Methodology Overview • Methodology in Practice • Image Rotation • JPEG • Conclusion 8

  15. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator CPU bar foo baz 9

  16. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator CPU bar foo baz 9

  17. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator CPU bar foo baz 9

  18. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator CPU bar foo baz 9

  19. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator CPU bar foo baz 9

  20. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator CPU bar foo baz 9

  21. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator 2. Invocations without accelerator are executed on same core as parent CPU bar foo baz 10

  22. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator 2. Invocations without accelerator are executed on same core as parent CPU foo baz 10

  23. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator 2. Invocations without accelerator are executed on same core as parent CPU foo baz 10

  24. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator 2. Invocations without accelerator are executed on same core as parent CPU foo baz 10

  25. Dynamic invocation tree: main() foo() bar() Computation Model 1. Dynamic function invocations bar() baz() mapped to corresponding accelerator 2. Invocations without accelerator are executed on same core as parent CPU bar foo 10

  26. Dynamic invocation tree: main() Communication foo() bar() Model Invocations communicate via load/ bar() baz() store dependencies CPU bar foo baz 11

  27. Dynamic invocation tree: main() ld A st A Communication foo() bar() Model Invocations communicate via load/ bar() baz() store dependencies CPU bar foo baz 11

  28. Dynamic invocation tree: main() ld A st A Communication foo() bar() Model Invocations communicate via load/ bar() baz() store dependencies CPU bar foo baz 11

  29. Dynamic invocation tree: main() ld A st A ld B Communication foo() bar() Model st B Invocations communicate via load/ bar() baz() store dependencies CPU bar foo baz 11

  30. Dynamic invocation tree: main() ld A st A ld B Communication foo() bar() Model st B Invocations communicate via load/ bar() baz() store dependencies CPU bar foo baz 11

  31. Talk Outline • Overall Vision • Accelerator System Model • Methodology Overview • Methodology in Practice • Image Rotation • JPEG • Conclusion 12

  32. Methodology Pt 1: main() Examine Application foo() bar() Pintool ➔ decorated call graph bar() baz() 13

  33. Methodology Pt 1: i main() Examine Application j l foo() bar() Pintool ➔ decorated call graph k m bar() baz() 13

  34. Methodology Pt 1: i main() Examine a d Application j l f foo() bar() Pintool ➔ decorated call graph c e b k m bar() baz() 13

  35. Pintool Functionality • Tool instruments calls, returns, loads and stores • Four logs are generated, all keyed off of a unique, invocation identifier Function invocation ID ➔ function name (-tfunction, -bfunction) Subcalls invocation ID ➔ list of subcall IDs (-tsubcalls, -bsubcalls) invocation ID ➔ dynamic instruction count (-ticount, -bicount) Instruction Count Data Transfers invocation ID ➔ invocation ID, bytes (-txfers, -bxfers, -xfer-chunk • At present, significant overhead relative to native (approximately 2000x on the short-running applications to be presented here). • Majority of overhead attributable to hash lookups (for tracking data transfers) and logfile writing 14

  36. Methodology Pt 2: i main() Evaluate Execution a d on Accelerators j l f foo() bar() Program runtime a function of c e b • computation rates • computation k • communication rates m bar() baz() • communication Enables evaluation of: • acceleration potential in the limit • sensitivity of potential to hardware parameters, program inputs, etc. 15

  37. i main() Why is gprof not a d sufficient? j l f foo() bar() Runtimes are machine- and c e b algorithm- dependent k m bar() baz() Does not capture data transfers between invocations 16

  38. Talk Outline • Overall Vision • Accelerator System Model • Methodology Overview • Methodology in Practice • Image Rotation • JPEG • Conclusion 17

  39. i main() Methodology In a d Practice j l f foo() bar() Image rotation c e b JPEG decode k m bar() baz() 18

  40. Simple Example: Image Rotation #define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else � iter_rot(); write_ppm(); return 0; } 19

  41. Image Rotation: Recursive Algorithm #define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { void rec_rot(int x, int y, int s) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); int i, j; } s >>= 1; read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); for (i = 0 ; i < s ; ++i) else � iter_rot(); write_ppm(); for (j = 0 ; j < s ; ++j) { return 0; int rgb = PIX(x+i, y+j); } PIX(x+i, y+j ) = PIX(x+i, y+j+s); PIX(x+i, y+j+s) = PIX(x+i+s, y+j+s); PIX(x+i+s, y+j+s) = PIX(x+i+s, y+j); PIX(x+i+s, y+j ) = rgb; } if (s <= 1) return; rec_rot(x,y+s,s); rec_rot(x+s,y+s,s); rec_rot(x+s,y,s); rec_rot(x,y,s); } 20

Recommend


More recommend