Architecting and Programming a Hardware-Incoherent Multiprocessor - PowerPoint PPT Presentation

Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy Wooil Kim, Sanket Tavarageri, P. Sadayappan, Josep Torrellas University of Illinois at Urbana-Champaign Ohio State University IPDPS 2016. May 2016

Motivation • Continued progress in transistor integration à 1,000 cores/chip • Need to improve energy efficiency • Example: Intel Runnemede [Carter HPCA 2013] 2

Intel Runnemede • Simplifies architecture: • Narrow-issue cores • Cores and memories hierarchically organized in clusters • Single address space • On-chip cache hierarchy without hardware cache coherence • Hardware-incoherent caches: • Easier to implement • How to program them? 3

Goal and Contributions Goal: Programming environment for a hardware-incoherent cache hierarchy Contributions: • Hardware extensions to manage hardware-incoherent caches • Flavors of Writeback (WB) and Self-invalidate (INV) instructions • Two small buffers next to the L1 cache • Hardware table in the cache controllers • Two user-friendly programming models • Rely on annotating synchronization operations and relatively simple compiler analysis • Average performance only 5% lower than hardware-coherent caches 4

How to Ensure Data Coherence? Hardware-incoherent caches do not rely on snooping or directory P0 P1 wr x P0 P1 3. self-invalidate A 1. write A WB x private cache private cache sync 4. read A 2. writeback A shared level sync INV x rd x 5

WB and INV Instructions • Memory instructions that give commands to the cache controller • WB(Variable) : writes back variable to the shared cache – Cache lines have fine-grain dirty bits – WB operates on whole line but only writes back the modified bytes • Different cores don’t overwrite each other in case of false sharing Data V D D Data D Data D Data • INV(Variable) : self-invalidates variable from the local cache – Uses a per-line valid bit – Writes back dirty bytes in the line, then invalidates the line • Prevents losing any dirty data • WB ALL, INV ALL // write back / invalidate the whole cache 6

Programming Models Block 0 Block 1 Block 2 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 L2 cache L2 cache L2 cache L3 cache 1. Shared-memory model inside each block and MPI across blocks 2. Shared-memory across all cores 7

Model 1: Shared Inside Block + MPI across P0 P1 wr A wr B … wr C … What to writeback WB sync When to writeback … … sync What to invalidate INV … When to invalidate rd A rd B … rd C 8

Orchestrating Communication • Use synchronization as hints for communication – WB(vars) before every synchronization point; INV(vars) after – If communication variables cannot be computed, use WB/INV ALL … Epoch E i-1 WB for E i-1 sync INV for E i Epoch E i … WB for E i sync INV for E i+1 Epoch E i+1 … 9

Annotations for Different Communication Patterns Barriers Critical Flags Dynamic sections happens-before epoch ordering (e.g. task queue) Need to detect data race communication and enforce it with WB/INV 10

Application Analysis Application Barrier Critical Section/flag Dyn Happens-Before FFT x LU x CHOLESKY x x x BARNES x x x RAYTRACE x x VOLREND x x OCEAN x x WATER x x • Instrumentation procedure – Analyze the communication patterns – If had sophisticated compiler, could do more efficient WB/INV 11

Hardware Support for Small Critical Sections • Modified Entry Buffer (MEB): – For small code sections such as critical sections – Accumulates the written line entry numbers à WB only those at end way 1 way 0 lock wr A wr B wr B … {4, 1} // do not WB whole cache unlock {2, 0} MEB cache 12

Hardware Support for Small Critical Sections (II) • Invalidated Entry Buffer (IEB): – For small code sections such as critical sections – Accumulate invalidated line addresses à avoid invalidating twice way 1 way 0 lock // don’t inval whole cache rd A rd B rd B tag B … unlock tag A IEB cache 13

Programming Models Block 0 Block 1 Block 2 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 L2 cache L2 cache L2 cache L3 cache 1. Shared-memory model inside each block and MPI across blocks 2. Shared-memory across all cores 14

Model 2: Shared-Memory Across All Cores • Inefficient solution: always WB/INV through L3 cache P0 P1 P2 P3 P4 P5 P6 P7 L2 cache L2 cache L3 cache • Propose: Level-adaptive WB and INV – WB/INV automatically communicate through the closest shared level of the cache • Closest shared cache level depends of thread mapping, which is unknown at compile time 15

Idea: Exploit Producer-Consumer Information • Software identifies producer-consumer thread pairs – (e.g., thread i produces data that will be consumed by thread j ) • Thread à core mapping unknown at compile time Thread i Thread j INV_PROD(x,i) X = … Epoch Epoch … = X WB_CONS(x,j) • Software instruments the code with level-adaptive WB/INV – Producer: WB_CONS (addr, ConsID) – Consumer: INV_PROD (addr, ProdID) 16

Hardware Support for Level Adaptive WB/INV • L2 cache controller has a hardware table ( ThreadMap ) – Contains IDs of the threads that have been mapped in the block • When executing WB_CONS (addr, ConsID): – Hardware checks if ConsID is running on the block – If so: WB pushes data to L2 only; else, to both L2 and L3 • Same when executing INV_PROD (addr, ProdID) INV_PROD(T0) P0 P1 P2 P3 P4 P5 P6 P7 WB_CONS(T7) L2 cache ThreadMap L2 cache ThreadMap L3 cache Th4 Th3 Th5 Th1 Th0 Th2 Th6 Th7 17

Compiler Support to Extract P-C Pairs • Approach: Use ROSE compiler to – Find P-C relation across OpenMP for constructs – Inter-procedural CFG – Dataflow analysis between potential P-C pairs • Assumption: Static scheduling of threads to processors A[i]: Region(A, 0, N, # of threads) = A: [ (N/th)*myid, (N/th)*(myid+1) ) #pragma omp parallel for for (i=0; i<N; i++) { B[i]: Region(A, 0, N, # of threads) A[i] = … ; = B: [ (N/th)*myid, (N/th)*(myid+1) ) B[i] = … ; } B[i+1]: Region(A, 1, N, # of threads) = B: [ (N/th)*myid +1, (N/th)*(myid+1) +1) #pragma omp parallel for for (i=0; i<N; i++) { … = A[i] + … ; WB B[i] INV_PROD … = B[i+1] + … ; } for myid+1 WB_CONS B[i+1] INV to myid-1 18

Evaluation • SESC simulator • 4-issue out-of-order cores with 32KB L1 caches • MESI Coherence protocol • Intra-block experiments: – 16 cores sharing a 2MB banked L2 cache – Each core: 16-entry MEB, 4-entry IEB – SPLASH2 applications HCC Directory hardware cache coherence Base Basic WB / INV B+M Base + MEB B+I Base + IEB B+M+I Base + MEB + IEB 19

Execution Time With MEB/IEB: average performance is only 2% lower than HCC Not shown: network traffic also comparable 20

Evaluation • Inter-block experiments: – 4 blocks of 8 cores each – Each block has a 1MB L2 cache – Blocks share a 16MB banked L3 – NAS applications analyzed with the ROSE compiler Base WB/INV all cached data to L3 Addr WB/INV selective data to L3 (compiler analysis) Addr+L Level Adaptive: WB_CONS/INV_PROD 21

Execution Time 1 0.8 0.6 Base Addr 0.4 Addr+L 0.2 0 Jacobi EP IS CG • When Level-Adaptive WB/INV is applicable, performance improves – EP,IS have reductions à no ordering, hence no P-C • Not shown: performance is on average 5% lower than HCC 22

Conclusions • Programming a hardware-incoherent cache hierarchy is challenging • Proposed HW extensions to manage it: • Flavors of WB and INV, including level-adaptive • Small MEB and IEB buffers next to the L1 cache • ThreadMap table in the cache controllers • Proposed two user-friendly programming models • Average performance only 5% lower than hardware-coherent caches • Future work: Enhance the performance with advanced compiler support 23

Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy Wooil Kim, Sanket Tavarageri, P. Sadayappan, Josep Torrellas University of Illinois at Urbana-Champaign Ohio State University IPDPS 2016. May 2016

Instruction Reordering by HW or Compiler • Instruction ordering requirement – WR à WB à Synchronization à INV à RD RD WR WR RD Required INV WB INV WB Desirable RD WR WR RD • Other orderings are desirable (e.g. to reduce traffic) • Cache lines can be evicted at any time 25

BACK-UP SLIDES 26

WB and INV Instructions • Operate at line granularity to minimize cache modifications • User unaware of the data placement • Granularity of dirty bits may vary (from byte to entire line) • Different flavors: – WB_byte // variable is a byte – WB_halfword – WB ALL // write back the whole cache – WB_L3 // push all the way to L3 27

Issued WB/INV 1 0.9 0.8 0.7 WB to L3 in Addr 0.6 WB to L3 in Addr+L 0.5 INV L2 in Addr 0.4 INV L2 in Addr+L 0.3 0.2 0.1 0 Jacobi EP IS CG Reduction in issued WB/INV in some applications 28

Architecting and Programming a Hardware-Incoherent Multiprocessor - PowerPoint PPT Presentation

Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy Wooil Kim, Sanket Tavarageri, P. Sadayappan, Josep Torrellas University of Illinois at Urbana-Champaign Ohio State University IPDPS 2016. May 2016 Motivation

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Architecting the Internet of Things Dieter Uckelmann Mark Harrison Florian Michahelles

Architecting Java solutions for CICS Architecting Java solutions for CICS Course introduction

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

RAIC: Architecting Dependable Systems Through Redundancy and Just-In-Time Testing For The ICSE

The Role of Event Description in Architecting Dependable Systems Marcio S. Dias Debra J.

Incoherent multiple scattering in pA and Vector boson-tagged jet production in AA Hongxi Xing

On the near-threshold incoherent photoproduction on the deuteron: Any trace of a resonance?

Coherent beam-beam effects X. Buffat Content Coherent vs. incoherent Self-consistent

Coherent beam-beam effects X. Buffat Content Coherent vs. incoherent Self-consistent

Coherent - incoherent transitions in resonant energy transfer dynamics Ahsan Nazir EPSRC

Air Force Research Laboratory MOSC Experiment Incoherent Scatter Observations of Artificially

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Statistical Paradigm Many problems can be posed as pattern recognition Image

Spacetime finite element methods in biomedical applications Olaf Steinbach Institut f ur

Optimization of Decision Trees for TCP Performance RCA Marco Weiss advised by Simon Bauer,

P t rs r

Estimating Treatment Effects in the Presence of Correlated Binary Outcomes and Contemporaneous

Embeddability and Decidability in the Turing Degrees Antonio Montalb an. University of

CSE 341 Lecture 21 delayed evaluation; thunks; streams slides created by Marty Stepp

The Teichm uller TQFT Volume Conjecture for Twist Knots Fathi Ben Aribi UCLouvain 24th