Flexible Flexible Architectural Architectural Support Support for for Fine Fine ‐ Grain rain Scheduling Scheduling Daniel Daniel Sanchez Sanchez Richard M. Richard Richard M Yoo Richard M. Yoo Yoo Yoo Christos Christos Kozyrakis Kozyrakis th 2010 16 th March March 16 2010 Stanford Stanford University University
Overview Overview • Our focus: User ‐ level schedulers for parallel runtimes – Cilk, TBB, OpenMP, … • Trends: – More cores/chip Need to exploit finer ‐ grain parallelism – Deeper memory hierarchies – Deeper memory hierarchies C Communication through shared i ti th h h d memory increasingly inefficient – Costlier cache coherence • Existing fine ‐ grain schedulers: g g – Software ‐ only: Slow, do not scale – Hardware ‐ only: Fast, but inflexible • Our contribution: Hardware ‐ aided approach – HW: Fast, asynchronous messages between threads (ADM) – SW: Scalable message ‐ passing schedulers SW: Scalable message passing schedulers – ADM schedulers scale like HW, flexible like SW schedulers 2
Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 3
Fine Fine ‐ grain parallelism grain parallelism • Fine ‐ grain parallelism: Divide work in parallel phase in small tasks (~1K ‐ 10K instructions) ( ) • Potential advantages: – Expose more parallelism p p – Reduce load imbalance – Adapt to a dynamic environment (e.g. changing # cores) • Potential disadvantages: – Large scheduling overheads g g – Poor locality (if application has inter ‐ task locality) 4
Task Task ‐ stealing schedulers stealing schedulers T 0 T 1 T n Threads Dequeue Enqueue Task Task Queues Steal • One task queue per thread • Threads dequeue and enqueue tasks from queues • When a thread runs out of work, it tries to steal tasks , from another thread 5
Task Task ‐ stealing: Components stealing: Components 1. Queues Enq/deq 2. Policies T 0 T 0 T 1 T 1 T n T n Steal 3. Communication • In software schedulers: Starved Starved —Queues and policies are cheap Stealing —Communication through shared Queues memory increasingly expensive! memory increasingly expensive! App 6
Hardware schedulers: Carbon Hardware schedulers: Carbon • Carbon [ISCA ‘07]: HW queues, policies, communication – One hardware LIFO task queue per core – Special instructions to enqueue/dequeue tasks • Implementation: – Centralized queues for fast stealing (Global Task Unit) – One small task buffer per core to hide GTU latency (Local Task Units) 31x 26x Starved Stealing l Queues Large benefits if app Useless if app doesn’t App matches HW policies matches HW policies match HW policies match HW policies 7
Approaches to fine Approaches to fine ‐ grain scheduling grain scheduling Fine ‐ grain scheduling Software ‐ only Hardware ‐ only Hardware ‐ aided OpenMP TBB Carbon Asynchronous Direct Cilk GPUs X10 Messages … ... SW queues & policies HW queues & policies SW queues & policies SW communication HW communication HW communication � � Low ‐ overhead � Low ‐ overhead High ‐ overhead � Flexible � Flexible � � � Flexible � Flexible Inflexible Inflexible � No extra HW � � General ‐ purpose HW Special ‐ purpose HW 8
Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 9
Asynchronous Direct Messages Asynchronous Direct Messages • ADM: Messaging between threads tailored to scheduling and control needs: —Low ‐ overhead Send from/receive to registers Independent from coherence —Short messages Asynchronous messages —Overlap communication with user level interrupts with user ‐ level interrupts and computation Generic interface —General ‐ purpose G l Allows reuse 10
ADM ADM Microarchitecture Microarchitecture • One ADM unit per core: • One ADM unit per core: – Receive buffer holds messages until dequeued by thread – Send buffer holds sent messages pending acknowledgement Send buffer holds sent messages pending acknowledgement – Thread ID Translation Buffer translates TID → core ID on sends – Small structures (16 ‐ 32 entries), don't grow with # cores Small structures (16 32 entries), don t grow with # cores 11
ADM ISA ADM ISA Instruction Instruction Description Description Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_send r1, r2 Returns source and message length at head of rx buffer adm_peek r1, r2 Dequeues message at head of rx buffer adm_rx r1, r2 Enable / disable receive interrupts Enable / disable receive interrupts adm ei / adm di adm_ei / adm_di Send and receive are atomic (single instruction) • – Send completes when message is copied to send buffer Send completes when message is copied to send buffer – Receive blocks if buffer is empty – Peek doesn't block, enables polling • ADM unit generates an user ‐ level interrupt on the running thread when a message is received – No stack switching, handler code partially saves context (used registers) → fast – Interrupts can be disabled to preserve atomicity w.r.t. message reception 12
Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 13
ADM ADM Schedulers Schedulers • Message ‐ passing schedulers • Replace parallel runtime’s (e.g. TBB) scheduler — Application programmer is oblivious to this • Threads can perform two roles: – Worker: Execute parallel phase, enqueue & dequeue tasks – Manager: Coordinate task stealing & parallel phase termination • Centralized scheduler: Single manager coordinates all T 0 is manager Manager T 0 ! ! 0 and worker! d k ! T 0 T 1 T 2 T 3 Workers 14
Centralized Scheduler: Updates Centralized Scheduler: Updates Approx task counts T 0 Manager 22 16 18 4 4 4 4 8 2 4 4 4 6 6 6 UPDATE <4> UPDATE <8> 6 4 7 6 5 8 3 2 3 5 Workers T 0 T 1 T 2 T 3 Task Queues • Manager keeps approximate task counts of each worker • Workers only notify manager at exponential thresholds 15
Centralized Scheduler: Steals Centralized Scheduler: Steals T 0 Manager STEAL_REQ _ Q <T1 ‐ >T2, 1> UPDATE <1> TASK 6 5 6 7 8 2 1 2 3 4 5 Workers T 0 T 1 T 2 T 3 Task Queues • Manager requests a steal from the worker with most tasks 16
Hierarchical Scheduler Hierarchical Scheduler • Centralized scheduler: � Does all communication through messages � Enables directed stealing, task prefetching � Does not scale beyond ~16 threads • Solution: Hierarchical scheduler —Workers and managers form a tree 2 nd Level Manager T 1 1 st Level Managers T 0 T 4 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Workers 17
Hierarchical Scheduler: Steals Hierarchical Scheduler: Steals 2 nd Level Manager 21 22 1 st Level Managers 4 3 18 1 1 0 1 1 5 4 2 7 Workers TASK (x2) TASK (x4) TASK • Steals can span multiple levels p p — A single steal rebalances two partitions at once — Scales to hundreds of threads 18
Outline Outline • Introduction • Asynchronous Direct Messages (ADM) • ADM schedulers • Evaluation 19
Evaluation Evaluation CMP tile • Simulated machine: Tiled CMP – 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64 (64 – 256 threads) 256 h d ) – 3 ‐ level cache hierarchy, directory coherence • Benchmarks: – Loop ‐ parallel: canneal, cg, gtfold Loop parallel: canneal cg gtfold – Task ‐ parallel: maxflow, mergesort, ced, hashjoin – Focus on representative subset of results, p , see paper for full set 64 ‐ core, 16 ‐ tile CMP 20
Results Results App Queues Stealing Starved • SW scalability limited by scheduling overheads • SW scalability limited by scheduling overheads • Carbon and ADM: Small overheads that scale • ADM matches Carbon � No need for HW scheduler � 21
Flexible policies: Flexible policies: gtfold gtfold case study case study • In gtfold, FIFO queues allow tasks to clear critical dependences faster p —FIFO queues trivial in SW and ADM —Carbon (HW) stuck with LIFO 31x 26x • ADM achieves 40x speedup over Carbon • Can’t implement all scheduling policies in HW! 22
Recommend
More recommend