Data-Centric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MICRO 2016
Executive summary Many-cores must exploit cache locality to scale Current speculative systems, e.g. TLS or TM, do not exploit locality Spatial Hints: run tasks likely to access the same data in the same place ◦ A software-given hint denotes the data a new task is likely to access ◦ Hardware maps tasks with the same hint to the same place ◦ Hardware uses hints to perform locality-aware load balancing Our techniques make speculative parallelism practical at large scale ◦ It is easy to modify programs to convey locality through hints ◦ Performance improves by 3.3x at 256 cores ◦ We reduce network traffic by 6.4x and wasted work by 3.5x DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 2
Prior speculative systems scale poorly TRANSACTIONAL MEMORY (TM) SCHEDULERS SPATIAL HINTS Reduce wasted work of coarse-grain txns Make accesses local for fine-grain tasks Limit concurrency: When to run a task? Less data movement: Where to run a task? Spatially map tasks for improved locality and less waste DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 3
Prior non-speculative locality techniques do not work for speculation STATIC TASK MAPPING DYNAMIC TASK MAPPING Data dependences known a priori Work stealing ◦ Linear algebra, Anton 2 [ ASPLOS �13 ] ◦ Cheap, local enqueues ◦ Steals to adapt to imbalance ◦ Limited application types Graph partitioning ◦ Stealing interferes with speculation ◦ Localizes communication and scheduling ◦ Slow preprocessing step ◦ Cannot adapt to imbalance DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 4
Baseline Architecture: Swarm [ MICRO ‘15 ] DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 5
Baseline Swarm execution model Programs consist of timestamped tasks ◦ Tasks can create children tasks with >= timestamp ◦ Tasks appear to execute in timestamp order swarm::enqueue(function_pointer, timestamp, arguments...); General execution model supports ordered and unordered parallelism DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 6
Baseline Swarm architecture Speculatively executes tasks out of order Tile organization 64-tile, 256-core chip Mem / IO Large hardware task queues L3 slice Router Scalable ordered speculation Tile L2 Mem / IO Mem / IO Scalable ordered commits L1I/D L1I/D L1I/D L1I/D Core Core Core Core Task unit Mem / IO Efficiently supports tiny speculative tasks DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 7
Spatial Hints in Action COMBINING SPECULATION AND LOCALITY DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 8
Example: Discrete event simulation (DES) 0 1 1 0 r s t = r XOR s 0 r A C 0 0 0 1 0 E t 1 0 1 1 0 D 1 0 0 1 1 1 0 s B Tasks C 1 =1 s=1 r=1 A=1 C 0 =0 B=1 D 1 =0 E 1 =0 t=0 D 0 =1 E 1 =1 t=1 0 1 2 3 4 5 6 Order = Simulated time ( ns ) DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 9
Extracting parallelism in DES Execute independent tasks out of order Tasks Data dependences Valid Schedule C s C s B D E t r A C B D E t r D E t D E t A C 2.4x parallelism 0 1 2 3 4 5 6 Order = Simulated time ( ns ) (more in larger circuits) Parallelism is plentiful despite data dependences DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 10
Speculation scales poorly without locality Swarm sends new tasks to random tiles des ◦ Good for load balance ◦ Poor locality hurts scalability beyond 100 cores Work stealing : a non-speculative scheduler Random ◦ Enqueue new tasks locally ◦ Steal from the most-loaded tile Stealing ◦ Not a good strategy for DES DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 11
Where is the locality? DES Schedule C Each task operates on a single gate s B D E t r D E t The gate is known when the task is created A C With fine-grain tasks, most data accessed is known at creation time DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 12
Data-centric speculation scales well Hints : map each gate to a statically-chosen tile des E Send new tasks for a gate A C E to its corresponding tile 186x D Hints B D Random 1. Less data movement Stealing 2. Conflicts are local, cheap, and less frequent But we can do better! DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 13
Load-balanced speculation scales best Static gate-to-tile mapping may cause hotspots des ◦ E.g. some gates toggle more frequently 236x Load-Balanced Hints Dynamically remap gates ( Hints ) across tiles Hints Random Stealing Programmer knows most of the data accessed Spatial Hints convey program-level knowledge to exploit locality DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 14
Spatial Hints Implementation DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 15
Hint mechanisms are straightforward SOFTWARE HARDWARE A Spatial Hint is an integer value Hashes each new task�s Hint to a tile ID ◦ Given at task creation time Serializes same- Hint tasks ◦ Denotes data likely to be accessed by the task ◦ E.g. the gate ID in DES 7 4 Localize most data accesses within a tile 1 1 Serialize tasks likely to conflict DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 16
Load balance with a level of indirection Static hint-to-tile mapping may cause imbalance Tile ID Hint 2 0xF00 H Instead, periodically remap hints across tiles to equalize load 1 Tile ID Bucket Hint 7 2 0xF00 1 H … 61 63 Reconfigurable Tile Map 40 DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 17
�Load� is different for speculation 1 Tile ID Bucket Hint 7 2 0xF00 1 H … 61 63 Reconfigurable Tile Map 40 Non-speculative systems use # queued tasks as a proxy for load When imbalanced, speculative systems often ◦ D on�t run out of work ◦ Abort more work or strain speculation resources Remap hints to tiles to balance # of committed cycles per tile DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 18
Adding hints to applications is easy void desTask(Timestamp ts, GateInput* input) { Gate* g = input->gate(); bool toggledOutput = g.simulateToggle(input); if (toggledOutput) { // Toggle all inputs connected to this gate for (GateInput* i : g->connectedInputs()) swarm::enqueue(desTask, /*Timestamp*/ ts + delay(g, i), /*Hint*/ i->gate()->id, i ); } } One line of code to express the Gate ID as a Hint DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 19
Adding hints to applications is easy Benchmark Hint Why? des Gate ID Map tasks for same gate to same tile nocsim Router ID Frequent intra-router communication bfs, sssp, Cache-line Several vertices reside on the same line astar, color address (Table ID, silo primary key) Each task accesses one database tuple genome, Multiple kmeans DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 20
See the paper for more details! Load balance reconfiguration algorithm Choice of application hints Relationship between task size and hint effectiveness DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 21
Evaluation DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 22
Methodology Event-driven, Pin-based simulator Scalability experiments from 1 – 256 cores ◦ Scaled-down systems have fewer tiles Target system: 256-core, 64-tile chip Mem / IO L3 slice Router 64 MB shared L3 (1MB/tile) Tile L2 256 KB per-tile L2s Mem / IO Mem / IO 16 KB per-core L1s L1I/D L1I/D L1I/D L1I/D In-order, single-issue, scoreboarded Core Core Core Core 16K task queue entries (64/core) Task unit 4K commit queue entries (16/core) Mem / IO DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 23
Hints make speculation practical on large-scale systems LBHints Hints Random Hints Random Random Load-Balanced Hints 3.3x faster bfs bfs bfs sssp sssp sssp astar astar astar 512 512 512 512 512 512 256 256 256 than Random (193x gmean vs 58x) Speedup Speedup Speedup 256 256 256 256 256 256 128 128 128 1 1 1 1 1 1 1 1 1 Load-Balanced Hints color color color des des des nocsim nocsim nocsim 128 128 128 256 256 256 512 512 512 17% – 27% faster than Hints Speedup Speedup Speedup 128 128 128 256 256 256 64 64 64 1 1 1 1 1 1 1 1 1 Stealing is inconsistent silo silo silo genome genome genome kmeans kmeans kmeans 256 256 256 128 128 128 256 256 256 across benchmarks Speedup Speedup Speedup 128 128 128 64 64 64 128 128 128 1 1 1 1 1 1 1 1 1 1c 1c 1c 128c 128c 128c 256c 256c 256c 1c 1c 1c 128c 128c 128c 256c 256c 256c 1c 1c 1c 128c 128c 128c 256c 256c 256c DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 24
Hints make speculation more efficient NoC data transferred 1.0 1.0 Aborted Cycles 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L bfs sssp astar color des nocsim silo genome kmeans bfs sssp astar color des nocsim silo genome kmeans Reduce wasted work by 6.4x Reduce network traffic by 3.5x DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS 25
Recommend
More recommend