an initial characterization of the emu chick
play

An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, - PowerPoint PPT Presentation

An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, (ECE) Jeff Young, (CS) Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, Jason Riedy (CSE) 5/21/2018 . . . . Migratory Memory-side Processing Main innovation:


  1. An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, (ECE) Jeff Young, (CS) Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, Jason Riedy (CSE) 5/21/2018 . . . .

  2. Migratory Memory-side Processing • Main innovation: Thread contexts migrate to the data • Threads always read from local memory • Migration is hardware-controlled, triggered on a remote read • CRNCH Rogue’s Gallery • Early access to Emu Chick hardware prototype . . . 2 5/21/2018 .

  3. Outline  Emu Architecture Description  Data Allocation and Thread Spawning  Benchmark Results • STREAM • Sparse Matrix Vector Multiply (SpMV) • Pointer Chasing  Simulator Validation  Conclusion . . . 3 5/21/2018 .

  4. Emu Architecture . . . 4 5/21/2018 .

  5. Fine-grained Memory Accesses  Narrow-channel DRAM (NCDRAM) • 8-bit bus allows access at 8-byte granularity without waste • Many narrow channels instead of few wide channels  Remote Writes • Write to remote nodelet without migrating • Proceed directly to the memory front-end, bypassing the GC  Remote Atomics • Performed in Memory Front-End (MFE), near memory . . . 5 5/21/2018 .

  6. Emu Cilk  cilk_spawn: create a child thread to execute a function in parallel • Creates an actual thread, not just a continuation stack frame • No work-stealing across nodelets  cilk_sync: wait for all child threads to complete • Threads die instead of waiting, last thread to arrive continues  “Remote Spawn”: Create thread on a remote nodelet • Determines location of cactus stack frame . . . 6 5/21/2018 .

  7. Emu Nodelet Emu Node Card Emu Chick Emu1 Rack (8 nodelets) (8 nodes) (256 nodes) Current Future Current Future Current Future # of cores 1 core 4 cores 8 cores 32 cores 64 cores 256 cores 8192 cores # of threads 64 256 512 2048 4096 16384 > 2 million Memory 2 GiB 8 GiB 16 GiB 64 GiB 128 GiB 512 GiB 16 TiB capacity # of 8-bit 1 channel 1 channel 8 channels 8 channels 64 channels 64 channels 2048 DDR4 channels Memory 120 MB/s 2.5 GB/s 1.2 GB/s 20 GB/s 8 GB/s 160 GB/s 5.12 TB/s bandwidth 7 Images and data from www.emutechnology.com

  8. STREAM: Thread spawning Serial Spawn Recursive Spawn https://www.cilkplus.org/tutorial-cilk-plus-keywords#cilk_for . . . 8 5/21/2018 .

  9. Emu Memory Layouts . . . 9 5/21/2018 .

  10. Nodelet 0 Nodelet 1 Nodelet 2 Nodelet 3 Serial Spawn Serial Remote Spawn Recursive Remote Spawn 10

  11. STREAM: Emu hardware results (single-node) ~140 MB/s per nodelet ~1.2 GB/s per node (8 nodelets) • Surprisingly, serial spawn performance matches recursive spawn • Remote spawn is necessary to saturate global bandwidth . . . 11 5/21/2018 .

  12. STREAM: Emu hardware results (multi-node) . . . 12 5/21/2018 .

  13. Proxy applications  Pointer chasing -> streaming graphs • Designed to mimic the traversal of an edge list in a streaming graph data structure • Using results to tune streaming graph engine for Emu  SpMV -> sparse tensor analysis • Exploring data layout options with a sparse matrix • Will transfer knowledge to design of sparse tensor library . . . 13 5/21/2018 .

  14. Sparse matrix-vector multiply (SpMV) Data layout is very important on Emu. We experimented with three layouts for the sparse matrix. In each case, the vector X was replicated onto all nodelets. . . . 14 5/21/2018 .

  15. SpMV results: Emu simulator vs Haswell Xeon • “2D” layout outperforms “1D” layout • Spurious thread migrations are limiting performance on Emu . . . 15 5/21/2018 .

  16. Pointer Chasing Microbenchmark  Designed to mimic access pattern of dynamically allocated data structures (i.e. streaming graphs) • Data-dependent loads : Memory-level parallelism is severely limited since each thread must wait for one pointer dereference to complete before accessing the next pointer • Fine-grained accesses : Spatial locality is restricted since all accesses are at a 16B granularity. This is smaller than a 64B cache line on x86 platforms, and much smaller than a typical DRAM page size (around 8KB). • Random access pattern : Since each block of memory is read exactly once in random order, caching and prefetching are mostly ineffective. . . . 16 5/21/2018 .

  17. Pointer chasing: Initialization  1. Create a linked list of elements “Ordered” Access pattern is sequential and predictable Plenty of spatial locality available . . . 17 5/21/2018 .

  18. Pointer chasing: Intra-block shuffle  2. Randomize traversal order of elements within each block “Intra-block shuffle” Creates small contiguous blocks of memory that are accessed in random order. Overall access pattern is still sequential. . . . 18 5/21/2018 .

  19. Pointer chasing: Block shuffle  3. Randomize traversal order of each block “Block shuffle” Overall access pattern is now random. Small chunks of sequential locality still available. . . . 19 5/21/2018 .

  20. Pointer Chasing: Sandy Bridge Xeon Results . . . 20 5/21/2018 .

  21. Pointer Chasing: Emu Hardware Results (single-node) . . . 21 5/21/2018 .

  22. Pointer Chasing: Bandwidth Utilization . . . 22 5/21/2018 .

  23. Simulator Validation: STREAM When configured to match the current hardware specifications, the simulator results match closely for local stream and global stream. . . . 23 5/21/2018 .

  24. Simulator Validation: Migrations • Pointer chasing performs 2x better in the simulator • Simulator is over-estimating migration throughput • Updated simulator matches more closely . . . 24 5/21/2018 .

  25. Conclusions  First independent evaluation of Emu Chick prototype  STREAM bandwidth is low, but scales well  Memory layout and thread management decisions are critical to achieving scalability in SpMV  Pointer Chasing maintains 80% memory bandwidth utilization in a worst- case pointer chasing scenario  Future work: Applying these lessons to streaming graphs analytics and sparse tensor processing . . . 25 5/21/2018 .

Recommend


More recommend