EDGE, TRIPS, and CLP Bending architecture to fit workload Zachary Weinberg 22 Jan 2009
The superscalar problem ◮ To have many instructions in flight at once, need huge on-chip control structures ◮ issue queue, reorder buffer, rename registers, hazard detection, bypass network, result bus, . . . ◮ Wire delays limit these to 100 instructions or so ◮ Alpha 21264: 80 ◮ Intel Core 2: 96 ◮ Intel Nehalem: 128 (64 with two threads) ◮ Heavy load on branch prediction ◮ Have only a few gate delays to make a prediction ◮ Need to issue a speculative branch nearly every cycle ◮ Need near-perfect accuracy to avoid frequent pipeline flushes
Grid processor SDRAM 0 IRQ EBI Global dispatch network (GDN) Global status network (GSN) Processor 0 I G R R R R I G R R R R DMA SDC EBC N I G R R R R N N N I D E E E E I D E E E E I D E E E E N M M N Secondary Memory System I D E E E E I D E E E E I D E E E E N M M N I D E E E E I D E E E E N M M N I D E E E E I D E E E E I D E E E E N M M N I D E E E E Issues block fetch command and Signals completion of block execution, I-cache dispatches instructions miss refill, and block commit completion M M N N I D E E E E Operand network (OPN) Global control network (GCN) N M M N I D E E E E I G R R R R I G R R R R N M M N I D E E E E I D E E E E I D E E E E N M M N I D E E E E I D E E E E I D E E E E N N N N I G R R R R I D E E E E I D E E E E DMA SDC C2C I D E E E E Processor 1 I D E E E E SDRAM 1 C2C (x4) Handles transport of all data operands Issues block commit and block flush commands TRIPS prototype block diagram. TRIPS micronetworks. ◮ Limit communication between tiles ◮ Limit size of global control logic (G-tile only) ◮ Tremendous execution bandwidth (64 insns per E-tile) ◮ This is what we’d like to build, but how?
Data flow architectures avoid the superscalar problem ◮ Instructions activate when they have all their inputs ◮ There is no “program order” to maintain ◮ Each instruction says where its output goes ◮ Much control flow is replaced with predicated execution which means. . . ◮ Issue queue is simpler and can be distributed ◮ No reorder buffer necessary ◮ No need for rename regs or shared result bus ◮ Reduced load on branch prediction
But they have their own problems ◮ Loops must be “throttled” to avoid swamping the system with tokens ◮ Some implementations require exotic memory hardware (e.g. I -structures) ◮ Arguably superior but unfamiliar concurrency model (also true for exceptions, virtual memory, multitasking) ◮ Difficult or impossible to support code written in a conventional language
EDGE: a middle ground ◮ ISA designed for grid processors ◮ Lay out code in hyperblocks ◮ one entry, many exits ◮ instructions within form a data flow graph ◮ static assignment to execution tiles ◮ commit all instructions at once ◮ Conventional control flow between blocks ◮ Exceptions delayed to block boundary ◮ Can speculatively execute ahead of current block
Benefits ◮ Within a hyperblock, get benefits of data flow architecture ◮ Global structures only need to track inter-hyperblock state ◮ can be done with smaller structures ◮ gives more time to make decisions ◮ allows far more instructions in flight overall ◮ Avoids problems of pure data flow architecture ◮ Can execute conventional language code ◮ More familiar concurrency and exception model ◮ No looping within hyperblock, so no loop throttling needed
Problems ◮ Only one branch target per hyperblock ◮ Not uncommon to need one per 5–10 instructions ◮ Compiler must aggresively if-convert and unroll loops ◮ Code size may increase significantly ◮ Intra-hyperblock scheduling is fragile ◮ Goal is to put dependent instructions near each other ◮ Optimal schedule depends on processor details ◮ Like VLIW, may need recompilation for good performance ◮ Exception model is awkward for virtual memory ◮ must repeat entire blocks for each page fault ◮ worst case O ( n 2 ) penalty
First iteration: TRIPS ◮ Concrete design of a grid processor ◮ Simulated with simplifications (e.g. no page faults) ◮ 128-instruction hyperblocks ◮ Very simple execution tiles ◮ Three operational modes: D-morph From one thread, take one definite and many speculative blocks T-morph From several threads, take one definite and a few speculative blocks each S-morph Unroll a computational kernel across many blocks and run them all at once ◮ Values forwarded from producer to consumer blocks as available, through register file
D-morph Floating point Integer 21264 (baseline) 15 15 speculative depth 1 2 4 8 16 10 10 32 32 + perfect mem IPC 32 + perfect mem & BP IPC 5 5 0 0 a b c g m m p t v M w ammp art dct equake hydro2d mgrid mpeg2 swim tomcatv turb3d MEAN d z o z a o 8 c E p i m i r o r p p 8 f t c s l e A 2 p f m k e N r s x r i m ◮ Most like a regular single-thread OOO ◮ Tested on a subset of SPEC (what compiler could handle) ◮ Skip initialization, count only “useful” insns ◮ Competitive with Alpha even with no speculation ◮ Leaves Alpha in the dust with deeper speculation, esp. for floating point ◮ Mispredictions hurt, especially integer code
T-morph ◮ Like any SMT, sacrifices per-thread resources for concurrency ◮ Biggest hit is from lower speculative depth, higher network contention ◮ Selected eight SPEC benchmarks and ran them in pairs, fours, or all at once ◮ Two threads: 87% single thread throughput, 1.7x speedup ◮ Four threads: 61% single thread throughput, 2.4x speedup ◮ Eight threads: 39% single thread throughput, 2.9x speedup ◮ No comparison to multitasking on D-morph
S-morph ◮ Unroll a loop into many concurrent hyperblocks ◮ Can repeat without refetching 15 D-morph S-morph S-morph idealized ◮ Software control of L2 cache 1/4 LD B/W Compute Insts/Cycle 4X ST B/W 10 No Revitalization ◮ Benchmarked on 7 streaming 5 kernels; hand-coded, machine-scheduled assembly 0 ◮ Graph shows several different c d f f f i d i r t M o c t r 8 1 e a E n t n 6 a v s A e f N r o t r m design points ◮ 2–4x D-morph performance ◮ Requires extra control logic
Second iteration: CLP/TFlex 32-Core TFlex Array One TFlex Core Next-Block Predictor Control networks Block control 128-entry architectural R e g i s t e r SEQ 8-Kbit Next-block g register file f o r w a r d n i next-block address 2R 1W port , l o g i c & q u e u e s predictor local exit l2 2 top R AS entries history Operand Global history vector B TB network 4KB in queue Operand Predicted next-block address Int. direct-mapped buffer ALU global exit L1 I-cache Select logic 128x64b history vector g2 Operand T o next CTB network owner core out queue Operand FP buffer ALU 128x64b R AS 4KB block t2 128-entry header cache instruction window 40-entry 8KB 2-way Memory load/store Btype L1 D-cache network queue in/out ◮ Dynamically aggregates cores as workload demands ◮ Distributes all control logic (abolish the G-tile) ◮ Cores are now dual-issue for integer ops ◮ Each core has its own L1 caches ◮ More operand network bandwidth
Dynamic aggregation of cores P P P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P P L2 L2 L2 L2 L2 L2 L2 L2 P L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 P L2 L2 L2 L2 P P P P L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 ◮ Threads can run on any number of cores ◮ Authors anticipate operating system (or even hardware!) will assign each task an appropriate number of cores ◮ Benchmarking done with core counts chosen by hand ◮ Will execution with varying core count mess up scheduling?
Distributing all the control logic The owner of block A0, B0 The owner of block A1, B1 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 Lifetime of block A0 in thread0 and block B0 in thread1 (a) Fetch0 (b) Next-Block Prediction0 (c) Execution0 (d) Commit0 thread0 thread1 thread0 thread1 thread0 thread1 thread0 thread1 Lifetime of block A1 in thread0 and block B1 in thread1 (f) Next-Block Prediction1 (g) Execution1 (e) Fetch1 (h) Commit1 ◮ Each EDGE block has an owner core ◮ Chosen by hash of block address ◮ Directs the other cores through fetch, execution, commit ◮ Instructions spread across cores as in TRIPS ◮ Fetch done by all cores in parallel ◮ Block commit by four-way handshake ◮ Branch history either kept with owner core (local) or transmitted along with branch predictions (global)
Recommend
More recommend