relaxed systems architecture instruction fetching ben
play

Relaxed Systems Architecture: Instruction Fetching Ben Simner - PowerPoint PPT Presentation

Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41 Motivation Why?


  1. Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41

  2. Motivation Why? Want to understand: TLBs, Instruction Caches, Interrupts Want to prove: Operating Systems, JITs, Hypervisors 2/41

  3. But first. . . Computers are fast. . . . . . but terrible ! 3/41

  4. Intel (Skylake) die 2 2 Source: https://en.wikichip.org/wiki/intel/microarchitectures/ skylake_(client) 4/41

  5. Intel (Skylake) die 5/41

  6. Intel (Skylake) die 5/41

  7. Intel (Skylake) die 5/41

  8. Intel (Skylake) die 5/41

  9. x86: Observable complexity Dekker’s/Peterson’s mutual exclusion algorithm (extract) Thread A Thread B flagA ← 1 ; flagB ← 1 ; while flagB while flagA {} ; {} ; print (“ A ”) print (“ B ”) x86 hardware can execute both prints! 6/41

  10. x86: TSO Architecture Source Code Model CPU0 CPU1 Thread A Thread B flagA = 1 flagB = 1 . . . . . . flagA ← 1 ; flagB ← 1 ; Store Buffer Store Buffer print ( flagA ) print ( flagB ) flagA = 0 flagB = 0 . . . RAM 7/41

  11. State of the Art Models : ◮ Abstract Hardware Operational ◮ Axiomatic-Style 8/41

  12. x86-TSO: Operational Semantics ◮ State = Abstracted Machine State � m : M : addr → value ; � B : tid → ( addr × value ) list ; ◮ Structural Operational Semantics m ′ = � m with B := m . B ⊕ ( t �→ (( x , v ) : m . B t )) WB t : Wx = v m m ′ 9/41

  13. x86-TSO: Axiomatic-Style Source Code x ← 1 ; y ← 1 ; print ( y ) print ( x ) Potential Execution #1 Potential Execution #2 W y=1 W y=1 W x=1 W x=1 . . . R y=0 R x=1 R y=1 R x=0 10/41

  14. A Candidate Execution Pre-execution = Set of Events + Induced Binary Relations (po/data/addr) Candidate = Pre-execution + Existentially Quantified Relations (co/rf) Definition of a valid Candidate Allowed Execution (“Axiomatic Model”): W y=1 W x=1 poWR = po ∩ ( W × R ) po po uniproc = po-loc ∪ ( po \ poWR ) rf fr = rf − 1 ; co rf R y=0 R x=1 tso = rf ∪ fr ∪ co axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From 11/41

  15. TSO: Forbidden Execution Forbidden Execution Axiomatic Model: R y=1 W x=1 poWR = po ∩ ( W × R ) fr po po uniproc = po-loc ∪ ( po \ poWR ) rf fr = rf − 1 ; co rf W y=1 tso = rf ∪ fr ∪ co R x=0 axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From fr = From-Reads 12/41

  16. TSO: Allowed Execution Allowed Execution Axiomatic Model: W y=1 W x=1 poWR = po ∩ ( W × R ) fr po po uniproc = po-loc ∪ ( po \ poWR ) fr fr = rf − 1 ; co rf rf R y=0 tso = rf ∪ fr ∪ co R x=0 axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From fr = From-Reads 13/41

  17. “user-mode” concurrency Much work not covered here: ◮ Fences ◮ Atomics ◮ Mixed-size ◮ Multi-copy atomicity ◮ Other Architectures: IBM Power, Arm, RISC-V 14/41

  18. Systems Architecture Semantics Exceptions and Interrupts Instruction Fetch ESOP2020 with Ohad Kammar Pagetables and TLBs Devices and NVME Future Work . . . 15/41

  19. JITs Just-In-Time Compilation CALL f f : Jump 0x1000 . . . CALL g Jump 0x2000 . . . CALL f . . . g : . . . Jump Table . Source Code . . Compiled Code Optimized code now unsound, have to re-compile! 16/41

  20. JITs JIT: de-opt after executing g CALL f Jump 0x1000 f : . . . CALL g Jump 0x2000 . . PC . CALL f . . . g : . . . Jump Table Source Code . . . Compiled Code Optimized code now unsound, have to re-compile! 17/41

  21. JITs JIT: re-compile CALL f Jump 0x1000 f : . . . CALL g Jump 0x2000 Jump 0x3000 PC CALL f . . . . . . g : . . . f : Jump Table Source Code . . . Compiled Code Optimized code now unsound, have to re-compile! 18/41

  22. ARMv8: How to safely modify code? 19/41

  23. RISC-V/x86/Power: How to? Similar for IBM Power Much easier on x86 RISC-V not decided yet . . . Focus on ARMv8-A for rest of talk. . . 20/41

  24. An Instruction Fetching Test Overwrite code of function f Write f = “ print ( 2 )” CALL f . . . Then, Call f f : print ( 1 ) RETURN . . . Memory 21/41

  25. Real A64 Assembly Initial state: 0:W0="B l1", 0:X1=f Thread 0 f STR W0,[X1] f: B l0 BL f l1: MOV X0,#2 RET l0: MOV X0,#1 RET Allowed: 1:X0=1 Relaxed Result Observed in ~99% of experimental runs on multiple devices. 22/41

  26. An Architectural Model! decode Fetch Queue Write f = “ print ( 2 )” per-thread new CALL f fetch Thread fetch request . . . Abstract icache print ( 1 ) f : write data add to icache RETURN read data Prefetching . . . Abstract Stale instructons global dcache Memory Source Code Data buffering 23/41

  27. An Architectural Model! decode Fetch Queue Write f = “ print ( 2 )” per-thread new CALL f fetch Thread fetch request . . . Abstract icache print ( 1 ) f : write data add to icache RETURN read data Prefetching . . . Abstract Stale instructons dcache Memory Source Code Data buffering 23/41

  28. Unexpected Coherence! Thread A Thread B f = “ print ( 2 )” CALL f . . . . . . f : print ( 1 ) If f executes print ( 2 ) Then print ( f ) must print the updated memory ( 2 ). print ( f ) RETURN . . . 24/41

  29. Real A64 Assembly Initial state: 0:W0="B l1", 0:X1=f, 1:X2=f Thread 0 Thread 1 f STR W0,[X1] BL f f: B l0 LDR X1,[X2] l1: MOV X0,#2 RET l0: MOV X0,#1 RET Forbidden: 1:X0=2, 1:X1="B l0" 25/41

  30. Other Phenomena Not Mentioned Here: ◮ (In)coherence ◮ Multiple images in I-cache ◮ Multiple images in D-cache(s) ◮ Direct Data Intervention ◮ Speculating cache maintenance ◮ O/S Migration ◮ and others . . . 26/41

  31. Operational Model decode Fetch Queue per-thread new fetch Thread fetch request Abstract icache write data add to icache read data Abstract global dcache Memory 27/41

  32. Operational State � ts : tid �→ instruction _ tree m : � ss : storage _ subsystem � storage _ subsystem : mem : write list icache : tid �→ write set dcache : write list � . . . 28/41

  33. Thread State Explicit Speculation Sequential ISA Spec 29/41

  34. Thread State Explicit Speculation Sequential ISA Spec Sequential ISA Spec 29/41

  35. Thread State Explicit Speculation Explicit Speculation Sequential ISA Spec Sequential ISA Spec 29/41

  36. Operational: Transitions Transitions: ◮ Step ISA Spec ◮ Memory Read/Write ◮ . . . ◮ Fetch Request ◮ Fetch Instruction (from icache) ◮ Decode Instruction New! ◮ . . . ◮ Update Instruction Cache ◮ Flow Writes into Memory ◮ Reset Instruction *exact names my vary 30/41

  37. Operational Rule (prose) Flow Writes into Memory An instruction i in the state Perform_DC (address, state_cont) can complete if all po-previous DMB ISH and DSB ISH instructions have finished. Action: 1. For the most recent writes ws which are in the same data cache line of minimum size in the abstract data cache as address , update the memory with ws ; 2. Remove all those writes from the abstract data cache. 3. Set the state of i to Plain (state_cont) . 31/41

  38. Operational Rule (lem) let flat_propagate_dc params state _cmr addr = (* remove all to that cacheline from buffer *) let (overlapping, fetch_buf) = List.partition (write_overlaps_with_addr (cache_line_fp addr)) state.flat_ss_fetch_buf in (* flow the overlapping writes into memory *) List.foldr (fun write state -> flat_write_to_memory params state write) (<| state with flat_ss_fetch_buf = fetch_buf |>) overlapping 32/41

  39. RMEM https://www.cl.cam.ac.uk/~pes20/rmem/ 33/41

  40. Axiomatic-Style Model | [dmb.ld]; po; [R|W] let iseq = [W];(wco&scl);[DC]; | [A|Q]; po; [R|W] (wco&scl);[IC] | [W]; po; [dmb.st] | [dmb.st]; po; [W] (* Observed-by *) | [R|W]; po; [L] let obs = rfe | fr | wco | [R|W|F|DC|IC]; po; [dsb.ish] | irf | (ifr;iseq) | [dsb.ish]; po; [R|W|F|DC|IC] (* Fetch-ordered-before *) | [dmb.sy]; po; [DC] let fob = [IF]; fpo; [IF] (* Cache-op-ordered-before *) | [IF]; fe let cob = [R|W]; (po&scl); [DC] | [ISB]; fe − 1 ; fpo | [DC]; (po&scl); [DC] (* Dependency-ordered-before *) (* Ordered-before *) let dob = addr | data let ob = obs|fob|dob|aob|bob|cob | ctrl; [W] | (ctrl | (addr; po)); [ISB] (* Internal visibility requirement *) | addr; po; [W] acyclic (po-loc|fr|co|rf) as internal | (addr | data); rfi (* External visibility requirement *) (* Atomic-ordered-before *) acyclic ob as external let aob = rmw | [range(rmw)]; rfi; [A|Q] (* Atomic *) empty rmw & (fre; coe) as atomic (* Barrier-ordered-before *) let bob = [R|W]; po; [dmb.sy] (* Constrained unpredictable *) | [dmb.sy]; po; [R|W] let cff = ([W];loc;[IF]) \ | [L]; po; [A] ob+ − 1 \ (co;iseq;ob+) | [R]; po; [dmb.ld] cff_bad cff ≡ CU 34/41

Recommend


More recommend