toward extreme scale manycore architectures
play

Toward Extreme-Scale Manycore Architectures Josep Torrellas - PowerPoint PPT Presentation

Toward Extreme-Scale Manycore Architectures Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu USC October 2016 Accelerated Progress in Transistor Integration Large


  1. Toward Extreme-Scale Manycore Architectures Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu USC October 2016

  2. Accelerated Progress in Transistor Integration • Large multicores for data centers • 3D stacked chips and cloud Intel 3D Xpoint memory Intel Xeon Phi 7290F (Oct 2016) 72 cores, 288 contexts, 260W Micron’s Hybrid Memory Cube 2

  3. Research is Pushing Ever Farther Ahead • More integration à 1,000 cores/chip Runnemede prototype [HPCA-13] Heat ¡Sink • Research on stacking multiple Integrated ¡Heat ¡Spreader ¡(IHS) Thermal ¡Interface ¡Material ¡(TIM) processor and memory dies DRAM ¡Silicon Die ¡to ¡Die ¡(D2D) ¡Layer DRAM ¡Frontside Metal ¡(Al) Through ¡Silicon ¡Vias (TSVs) Processor ¡Silicon Processor ¡Frontside Metal ¡(Cu) 3 C4 ¡pads Motherboard

  4. Meanwhile: Power Wall … and Performance Wall and Performance Wall • University of Illinois Blue Waters Supercomputer Performance: 11 PF Power: 6-11 MW (idle to loaded) 1MW = $1M per year electricity Computer architecture • Technology improvements in speed innovations become à and power slowing down strategic Josep Torrellas 4 Toward Extreme Scale …

  5. What We Need • Very high energy efficiency • Faster communication and synchronization • Ease of programming All at the same time Josep Torrellas 5 Toward Extreme Scale …

  6. Today’s Discussion • Focus: Reducing the cost of basic primitives for parallelism • Flavor of other challenges: energy, programmability Josep Torrellas 6 Toward Extreme Scale …

  7. Quest Making synchronization inexpensive Josep Torrellas 7 Toward Extreme Scale …

  8. Making Synchronization Inexpensive [ISCA-13][ASPLOS-15][ASPLOS-16] • Make memory fences free wr x wr y wr z fence fence fence rd z rd x rd y Breaking serialization in lock-free synchronization • CAS CAS Compare&Swap(CAS) CAS CAS x Scalable concurrent priority queues • node QueueHead node node node Josep Torrellas 8 Toward Extreme Scale …

  9. Making Synchronization Inexpensive [ISCA-13] • Make memory fences free (WeeFence) wr x wr y wr z fence fence fence rd z rd x rd y Breaking serialization in lock-free synchronization • Scalable concurrent priority queues • Josep Torrellas 9 Toward Extreme Scale …

  10. Fence: a Primitive for Parallelism • Instruction inserted by programmers or compilers • Prevents the compiler and HW from reordering memory accesses Until these are finished Read x • reads retired Time Write y • writes retired + drained from write buffer Fence Read z Cannot be observed by another processor Josep Torrellas 10 Toward Extreme Scale …

  11. Use of Fences (I) Enforce the correct order between accesses • Programmers insert fences in codes with fine-grain sharing: – Work-stealing algorithm in Cilk steal () take() Worker Thief dequeues takes from Head = … from tail and head and Tail = … checks head checks tail fence fence … = Tail … = Head Josep Torrellas 11 Toward Extreme Scale …

  12. Use of Fences (II) • Compilers insert fences in C++: – Programmer uses intentional data race for performance à declares variable as atomic – Compiler inserts fence after the access, does not reorder – Hardware does not reorder across fence Josep Torrellas 12 Toward Extreme Scale …

  13. If We Remove Fences: Incorrect Execution With fences: t1=1, t0=1 or both=1 A0 B0 A0 x = y = 0 A1 B1 B0 PA PB Without fences: t0 = t1 = 0 B0 A0 A1 A0: x =1 B1 A1 B1 A1 fence B0 A1: t0 = y B1 B0: y = 1 A0 fence write propagated Unintuitive bug: B1: t1 = x to memory Sequential Consistency(SC) Violation PA PB SC: execution appears as if accesses wr x wr y from multiple threads were interleaved in a uniprocessor rd y rd x Josep Torrellas 13 Toward Extreme Scale …

  14. Fence Overhead • Naïve implementation: stall all memory operations following the fence – The processor quickly stalls Josep Torrellas 14 Toward Extreme Scale …

  15. Modern Implementations: Perform Speculation • Reads following fences can load data speculatively – If no processor observes it, no problem – If coherence transaction received, rd is squashed and retried Still: speculative reads cannot retire until the WB is drained • Reorder Buffer r f w 2 (ROB) WB (Write Buffer) Write w 1 Time Expensive: r f Fence w 2 Fence in Xeon w 1 Read desktop stalls for r f 20—200 cycles. In a large MP? Josep Torrellas 15 Toward Extreme Scale …

  16. What if Fences Were Free? • Programmers could write faster fine-grained concurrent algorithms • C++/Java programmers would not have to worry about data races – Declare all shared variables as atomic – Compiler puts many fences, hardware still runs fast – Guaranteed Sequential Consistency (SC) Josep Torrellas 16 Toward Extreme Scale …

  17. Proposal: WeeFence (or WFence) [ISCA-13] • Goal: Eliminate any stall in the pipeline • Post-fence read retires before the pre-fence writes have drained – “Skip” the fence Spec execution Reorder Write r f w 2 Buffer Time (ROB) Fence w 1 WB r Read f Substantial w 2 gains when w 1 write misses f pile-up before w 2 the fence w 1 Josep Torrellas 17 Toward Extreme Scale …

  18. But … Not Stalling Can Cause Incorrect Execution x = y = 0 PA PB A0: x =1 Write WeeFence Time A1: t0 = y Fence B0: y = 1 WeeFence Read B1: t1 = x write propagated to memory What we want: not stall but avoid these SC violations Solution: Allow the reorder, check for this case, and stall the read (B1) Josep Torrellas 18 Toward Extreme Scale …

  19. But … Not Stalling Can Cause Incorrect Execution x = y = 0 PA PB A0: x =1 Write WeeFence Time A1: t0 = y Fence B0: y = 1 WeeFence Read write propagated to memory B1: t1 = x What we want: not stall but avoid these SC violations Solution: Allow the reorder, check for this case, and stall the read (B1) Conventional fences always conservatively stall ßà Not WeeFence Josep Torrellas 19 Toward Extreme Scale …

  20. WeeFence: The Idea PA PB Prevent “rd x” retiring early if: wr x wr y - There is a concurrent fence - Accesses vars in opposite order WeeFence WeeFence rd x rd y • At a fence: record the thread’s incomplete writes in a HW structure • Allow post-fence reads to execute before pre-fence writes complete • Check post-fence reads (rd x) against HW structure to find conflicts with other threads’ incomplete writes. – Conflict? Stall read – Else: Retire Josep Torrellas 20 Toward Extreme Scale …

  21. How WFence Works PA PB PS: Pending Set wr x wr y Wfence1 Wfence2 BS: Bypass Set rd x rd y PB PA wr x wr y (1) PS (3) Wfence1 Wfence2 execute x rd y rd x Table (2) Josep Torrellas 21 Toward Extreme Scale …

  22. How WFence Works PA PB PS: Pending Set wr x wr y Wfence1 Wfence2 BS: Bypass Set rd x rd y PB PA wr x wr y (1) (4) PS (3) PS Wfence1 Wfence2 (6) x execute x y local check stall rd y rd x Table (5) (2) Josep Torrellas 22 Toward Extreme Scale …

  23. How WFence Works (II) PA PB PS: Pending Set wr x wr y Wfence1 No fence present in x86 BS: Bypass Set wr x rd y PB PA wr x wr y (1) PS (3) Wfence1 execute x (4) rd y wr x Table BS y (2) Josep Torrellas 23 Toward Extreme Scale …

  24. How WFence Works (II) PA PB PS: Pending Set wr x wr y Wfence1 No fence present in x86 BS: Bypass Set wr x rd y PB PA wr x wr y (1) (3) PS coherence (5) Wfence1 execute x (4) rd y wr x Table BS y stall (2) Josep Torrellas 24 Toward Extreme Scale …

  25. Summary: How WFence Works PA PS: Pending Set wr x Wfence1 BS: Bypass Set PA rd y wr x (1) PS Table Wfence1 (3) z x z rd y check BS (2) (4) y (6) (5) stall execute & retire Josep Torrellas 25 Toward Extreme Scale …

  26. WFence • Cycles are rare: Wfence typically executes without stalling the processor • Works with cycles with any number of processors PA PB PC wr x wr y wr z Wfence Wfence Wfence rd z rd x rd y • No compiler support needed: Unmodified off-the shelf executable Josep Torrellas 26 Toward Extreme Scale …

  27. Improving WeeFence • The Global State is expensive to maintain with many threads • Can we eliminate the Global State (Pending Set) Josep Torrellas 27 Toward Extreme Scale …

  28. Eliminating the Global State PA PB [ASPLOS-15] Deadlock … . wr x wr y Insight: no deadlock if one processor Wfence1 Wfence2 stalls at the fence and generates no BS rd x rd y PB PA wr x wr y (3) (4) Wfence1 Wfence2 (1) rd y rd x (2) y x stall stall Josep Torrellas 28 Toward Extreme Scale …

  29. Asymmetric Fences: Strong Fence + Weak Fences [ASPLOS-15] • Given a conflict cycle with N processors: – 1 strong fence (no BS ) = conventional fence – N-1 weak fences that allow reordering = WeeFences without PS PA PB PC wr x wr y wr z rd z rd x rd y Conventional WeeFence fence Without PS z x Josep Torrellas 29 Toward Extreme Scale …

Recommend


More recommend