Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 - PowerPoint PPT Presentation

Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 Jade Alglave 2 , 3 Luc Maranget 3 Derek Williams 4 1 University of Cambridge 2 Oxford University 3 INRIA 4 IBM June 2011

Programming shared-memory multiprocessors No Sequential Consistency (SC) and not since 1972 But what do we get? “Relaxed Memory”, differing on different architectures: x86, SPARC — Relatively strong, better understood; POWER/ARM — Weaker, widely used, not widely understood; High-level languages — Different again Models informed by POWER/ARM features Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 2 / 13

Relaxed memory behaviour: Message Passing Thread 0 Thread 1 x = 1 while (y == 0) y = 1 {} ; r = x ( read 0? ) Thread 0 Thread 1 Forbidden on SC, or x86-TSO a: W[x]=1 c: R[y]=1 rf Allowed on POWER ( ∼ 1e6 in po po 2e9 on a POWER7) b: W[y]=1 rf d: R[x]=0 Test MP : Allowed Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 3 / 13

What is going on? Visible Microarchitectural Effects: Out-of-order, and Speculative Execution Buffering of Stores and Loads Topology of Interconnection Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 4 / 13

Enforcing order where needed Thread 0 Thread 1 x = 1 while (y == 0) {} ; sync() ( read 0? ) y = &x r = *y sync: writes in order ◮ On the same thread; and Thread 0 Thread 1 ◮ When propagating to other a: W[x]=1 c: R[y]=&x rf threads sync addr Dependency: reads in order b: W[y]=&x rf d: R[x]=0 ◮ Later read not issued until Test MP+sync+addr : Forbidden resolved Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 5 / 13

POWER model in general: . . . How do we find out? Architecture Manuals: Ambiguous prose “all that horrible horribly incomprehensible and confusing [...] text that no-one can parse or reason with — not even the people who wrote it” — Anonymous Processor Architect, 2011 1 Concrete Implementation: Proprietary Extremely complex, and too low-level Changes across generations 1 Not Derek Williams! Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 6 / 13

Our work Rigorous Architecture Do lots of tests (borrow, handwrite, autogenerate) on Power G5, 6, and 7 Discuss with designers/architects Develop an abstract operational model Matches observed behaviour (intentionally looser in some aspects) Simple enough to understand Only considering application and common OS code, with no unaligned/mixed-size accesses (no self-modifying code, device memory, or page table changes) Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 7 / 13

The model structure Overall structure: Thread Thread Write request Write announce Barrier request Barrier ack Storage Subsystem Some aspects are thread-only, some storage-only, some both Threads and Storage Subsystem: Abstract state machines Speculative execution in Threads; Topology-independent Storage Subsystem Formally: transitions, guarded by preconditions, change state, and synchronize with each other Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 8 / 13

Cumulativity: Programming on many threads Thread 0 Thread 1 Thread 2 x = 1 while (x == 0) while (y == 0) {} ; {} ; ( read 0? ) sync() r = *y y = &x Thread 0 Thread 1 Thread 2 a: W[x]=1 b: R[x]=1 d: R[y]=&x rf rf sync addr c: W[y]=&x e: R[x]=0 rf Test WRC+sync+addr : Forbidden The sync is cumulative : it keeps (a) and (c) in order for all threads Flipping the dependency and barrier does not recover SC Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 9 / 13

Model Excerpt Propagate write to another thread The storage subsystem can propagate a write w (by thread tid ) that it has seen to another thread tid ′ , if: the write has not yet been propagated to tid ′ ; w is coherence-after any write to the same address that has already been propagated to tid ′ ; and all barriers that were propagated to tid before w (in s . events propagated to ( tid ) ) have already been propagated to tid ′ . Action: append w to s . events propagated to ( tid ′ ) . Explanation: This rule advances the thread tid ′ view of the coherence order to w , which is needed before tid ′ can read from w , and is also needed before any barrier that has w in its “Group A” can be propagated to tid ′ . Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 10 / 13

Overall Model Size Explanation in ∼ 3 pages of prose Microarchitectural intuitions No extraneous concrete details ∼ 2500 lines of machine-processed math In LEM [ITP’11], a simple new semantic metalanguage Can extract executable code, and theorem-prover code Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 11 / 13

Validating the model Extract executable code from definition, exhaustively enumerate possible behaviours of tests Run many iterations of tests on real hardware (Power G5, 6, 7) Excerpt of results: Test Model POWER 6 POWER 7 WRC+sync+addr Forbid ok 0 / 16G ok 0 / 110G WRC+data+sync Allow ok 150k / 12G ok 56k / 94G PPOCA Allow unseen 0 / 39G ok 62k / 141G PPOAA Forbid ok 0 / 39G ok 0 / 157G LB Allow unseen 0 / 31G unseen 0 / 176G Agreed with key IBM Power designers/architects Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 12 / 13

Summing up A mathematically precise, empirically validated, operational model of POWER Microarchitectural intuitions, but abstract: no implementation details Rigorous Architecture Can reason about low-level code above it (static analysis tools) Can build on for software verification (e.g. compiler verification) Can use as specification to test implementations . . . Lots to be done! Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 13 / 13

Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 - PowerPoint PPT Presentation

Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 Jade Alglave 2 , 3 Luc Maranget 3 Derek Williams 4 1 University of Cambridge 2 Oxford University 3 INRIA 4 IBM June 2011 Programming shared-memory multiprocessors No Sequential

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Multiprocessors (Chapter 9) Idea: create powerful computers by connecting many smaller ones

1 Trends when work was done OS Issues for multiprocessors A period when multiprocessors were

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Multiprocessors/Multicores Presented by Yue Gao September 26, 2013 Presented by Yue Gao

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

Power-aware Manhattan routing on chip multiprocessors Anne Benoit 1 , Rami Melhem 2 , Paul

Graph-Based Resource Allocation with Conflict Avoidance for V2V Broadcast Communications Luis F.

Why are computers so @#!*, and what can we do about it? Peter Sewell University of Cambridge

The Role of Play in Self-Regulation The Role of Play in Self-Regulation Opportunities to teach

Dependently Typed Programming with Domain-Specific Logics Dan Licata Thesis Committee: Robert

Developing the Prosody XMPP server in Lua Matthew Wild ('MattJ') @FOSDEM 16 Introduction Why

Announcements Extra credit Dont forget to enter questions on Canvas. At least 1 on

Some Cyclicity and Opacity Effects in the Prosody of Two Different Clitic Classes in New-

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio