An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, (ECE) Jeff Young, (CS) Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, Jason Riedy (CSE) 5/21/2018 . . . .
Migratory Memory-side Processing • Main innovation: Thread contexts migrate to the data • Threads always read from local memory • Migration is hardware-controlled, triggered on a remote read • CRNCH Rogue’s Gallery • Early access to Emu Chick hardware prototype . . . 2 5/21/2018 .
Outline Emu Architecture Description Data Allocation and Thread Spawning Benchmark Results • STREAM • Sparse Matrix Vector Multiply (SpMV) • Pointer Chasing Simulator Validation Conclusion . . . 3 5/21/2018 .
Emu Architecture . . . 4 5/21/2018 .
Fine-grained Memory Accesses Narrow-channel DRAM (NCDRAM) • 8-bit bus allows access at 8-byte granularity without waste • Many narrow channels instead of few wide channels Remote Writes • Write to remote nodelet without migrating • Proceed directly to the memory front-end, bypassing the GC Remote Atomics • Performed in Memory Front-End (MFE), near memory . . . 5 5/21/2018 .
Emu Cilk cilk_spawn: create a child thread to execute a function in parallel • Creates an actual thread, not just a continuation stack frame • No work-stealing across nodelets cilk_sync: wait for all child threads to complete • Threads die instead of waiting, last thread to arrive continues “Remote Spawn”: Create thread on a remote nodelet • Determines location of cactus stack frame . . . 6 5/21/2018 .
Emu Nodelet Emu Node Card Emu Chick Emu1 Rack (8 nodelets) (8 nodes) (256 nodes) Current Future Current Future Current Future # of cores 1 core 4 cores 8 cores 32 cores 64 cores 256 cores 8192 cores # of threads 64 256 512 2048 4096 16384 > 2 million Memory 2 GiB 8 GiB 16 GiB 64 GiB 128 GiB 512 GiB 16 TiB capacity # of 8-bit 1 channel 1 channel 8 channels 8 channels 64 channels 64 channels 2048 DDR4 channels Memory 120 MB/s 2.5 GB/s 1.2 GB/s 20 GB/s 8 GB/s 160 GB/s 5.12 TB/s bandwidth 7 Images and data from www.emutechnology.com
STREAM: Thread spawning Serial Spawn Recursive Spawn https://www.cilkplus.org/tutorial-cilk-plus-keywords#cilk_for . . . 8 5/21/2018 .
Emu Memory Layouts . . . 9 5/21/2018 .
Nodelet 0 Nodelet 1 Nodelet 2 Nodelet 3 Serial Spawn Serial Remote Spawn Recursive Remote Spawn 10
STREAM: Emu hardware results (single-node) ~140 MB/s per nodelet ~1.2 GB/s per node (8 nodelets) • Surprisingly, serial spawn performance matches recursive spawn • Remote spawn is necessary to saturate global bandwidth . . . 11 5/21/2018 .
STREAM: Emu hardware results (multi-node) . . . 12 5/21/2018 .
Proxy applications Pointer chasing -> streaming graphs • Designed to mimic the traversal of an edge list in a streaming graph data structure • Using results to tune streaming graph engine for Emu SpMV -> sparse tensor analysis • Exploring data layout options with a sparse matrix • Will transfer knowledge to design of sparse tensor library . . . 13 5/21/2018 .
Sparse matrix-vector multiply (SpMV) Data layout is very important on Emu. We experimented with three layouts for the sparse matrix. In each case, the vector X was replicated onto all nodelets. . . . 14 5/21/2018 .
SpMV results: Emu simulator vs Haswell Xeon • “2D” layout outperforms “1D” layout • Spurious thread migrations are limiting performance on Emu . . . 15 5/21/2018 .
Pointer Chasing Microbenchmark Designed to mimic access pattern of dynamically allocated data structures (i.e. streaming graphs) • Data-dependent loads : Memory-level parallelism is severely limited since each thread must wait for one pointer dereference to complete before accessing the next pointer • Fine-grained accesses : Spatial locality is restricted since all accesses are at a 16B granularity. This is smaller than a 64B cache line on x86 platforms, and much smaller than a typical DRAM page size (around 8KB). • Random access pattern : Since each block of memory is read exactly once in random order, caching and prefetching are mostly ineffective. . . . 16 5/21/2018 .
Pointer chasing: Initialization 1. Create a linked list of elements “Ordered” Access pattern is sequential and predictable Plenty of spatial locality available . . . 17 5/21/2018 .
Pointer chasing: Intra-block shuffle 2. Randomize traversal order of elements within each block “Intra-block shuffle” Creates small contiguous blocks of memory that are accessed in random order. Overall access pattern is still sequential. . . . 18 5/21/2018 .
Pointer chasing: Block shuffle 3. Randomize traversal order of each block “Block shuffle” Overall access pattern is now random. Small chunks of sequential locality still available. . . . 19 5/21/2018 .
Pointer Chasing: Sandy Bridge Xeon Results . . . 20 5/21/2018 .
Pointer Chasing: Emu Hardware Results (single-node) . . . 21 5/21/2018 .
Pointer Chasing: Bandwidth Utilization . . . 22 5/21/2018 .
Simulator Validation: STREAM When configured to match the current hardware specifications, the simulator results match closely for local stream and global stream. . . . 23 5/21/2018 .
Simulator Validation: Migrations • Pointer chasing performs 2x better in the simulator • Simulator is over-estimating migration throughput • Updated simulator matches more closely . . . 24 5/21/2018 .
Conclusions First independent evaluation of Emu Chick prototype STREAM bandwidth is low, but scales well Memory layout and thread management decisions are critical to achieving scalability in SpMV Pointer Chasing maintains 80% memory bandwidth utilization in a worst- case pointer chasing scenario Future work: Applying these lessons to streaming graphs analytics and sparse tensor processing . . . 25 5/21/2018 .
Recommend
More recommend