TSO-Atom icity: TSO Enforcem ent for A Aggressive Program Optim ization i P O ti i ti Cheng Wang, Youfeng Wu, Jaewoong Chung Programming Systems Lab / MPR Intel Labs Programming Systems Lab
Motivation � Two different usage models in architecture support for atomic region execution – HW primitive to enforce exclusive memory access p y – Support transactional memory, lock elision, etc – HW primitive to enforce Sequential Consistency (SC) – Support SC-preserve aggressive program optimization pp p gg p g p � Popular architectures like X86 / SPARC implement Total Store Ordering (TSO) instead of SC – Weaker consistency in TSO leads to more efficient HW – Weaker consistency in TSO leads to more efficient HW implementation � TSO-Atomicity: new HW primitive to enforce TSO – Weaker consistency than atomicity leads to more efficient Weaker consistency than atomicity leads to more efficient HW implementation than atomicity – Support TSO-preserve aggressive program optimization Programming Systems Lab 2
Relationship betw een Different Consistency Model Atom icity TSO-Atom icity Optim iz Consist Enforce ? e Mem or tency for zation ? ? ? ry r SC SC TSO TSO W eaker Consistency w ith More Efficient I m plem entation Programming Systems Lab 3
SC Vs. TSO SC Execution Trace TSO Execution Trace ld1 ld1 st2_start st2 start st2_start st2_end st2_end st3_start st3_start ld4 store st3 cache t3 h forw ard st5_start st3 cache m iss m iss ld6 st3_end st3 end st3_end st5_end ld4 ld7 st5_start � st_ start w rite store data into store buffer st5 end st5_end � st_ end w rite store data from store buffer to ld6 cache/ m em ory ld7 � Later load m ay be reordered w ith earlier store � Data forw arded from store buffer to later load Data forw arded from store buffer to later load Programming Systems Lab 4
Relationship betw een Different Consistency Model Atom icity TSO-Atom icity Optim i Consis Enforce ? ization e Mem o ? ? tency fo ? or ry SC SC TSO TSO W eaker Consistency w ith More Efficient I m plem entation Programming Systems Lab 5
Execution w ith Atom icity first inst execution first load first load execution first store execution snoop snoop load store com m it all load/ store atom ically atom ically Programming Systems Lab 6
I nefficiency of Atom icity ld1 st2_start st2_end st3_start region1 st3 cache miss st3_end ld4 st5_start st5_end region2 ld6 ld7 Across region boundary, atom icity suffer sim ilar inefficiency as SC Programming Systems Lab 7
Atom icity Vs. TSO-Atom icity first inst first inst execution execution first load first load first load first load execution execution first store first store execution execution snoop snoop load snoop snoop snoop load store store com m it all load atom ically com m it all com m it all store load/ store atom ically y atom ically atom ically Execution w ith Atom icity Execution w ith TSO-Atom icity Programming Systems Lab 8
Efficient I m plem entation of TSO-Atom icity region1 i 1 TSO-Atom icity Execution Trace ld1 st2_start st2_end region2 st3_start region1 store ld4 forward forward st3 cache st3 cache st5_start t5 t t miss ld6 st3_end region2 st5 end st5_end ld7 � Load snoop in region2 use the speculative bits release by load commit Load snoop in region2 use the speculative bits release by load commit in region1 � No snooping for store data in store-buffer until written to cache � Across region boundary, TSO-atomicity achieve same efficiency as TSO Programming Systems Lab 9
Prevent Unnecessary Abort Init: A= 0, B= 0 thread1 thread2 thread1 thread2 r1 � ld B r2 � ld A st A � 1 st A � 1 st B � 1 st B � 1 Results: r1= 0, r2= 0 � By committing load earlier, reduce unnecessary abort due to memory conflicts � To enforce TSO instead of exclusive memory access, TSO- atomicity is not constrained to produce only serializable scheduling Programming Systems Lab 1 0
Sem antics of Mem ory Consistency Models TSO I nit: A= 0 , B= 0 SC thread1 thread2 st A � 1 st B � 1 Atom icity r1 � ld B r1 � ld B r2 � ld A r2 � ld A TSO-Atom icity Results ( r1 , r2 ) TSO (0, 0), (0, 1), (1, 0), (1,1) SC SC (0, 1), (1, 0), (1, 1) (0, 1), (1, 0), (1, 1) Atomicity (0, 1), (1, 0) TSO-Atomicity (0, 0), (0, 1), (1, 0) Programming Systems Lab 1 1
Relationship betw een Different Consistency Model Atom icity TSO-Atom icity Optim i Consis Enforce ization e Mem o ? ? tency fo ? or ry SC SC TSO TSO W eaker Consistency w ith More Efficient I m plem entation Programming Systems Lab 1 2
Atom icity Enforce SC for Optim ization Physical Execution Logical Execution com putation • Logically, all computation execute atomically at ll ll ll region commit time • The sem antic of region w ith atom icity is The sem antic of region w ith atom icity is independent of optim izations w ithin the region • Enforce SC for aggressive program optimization Programming Systems Lab 1 3
Relationship betw een Different Consistency Model Atom icity TSO-Atom icity Optim iz Consist Enforce e Mem or tency for zation ? ry r SC SC TSO TSO W eaker Consistency w ith More Efficient I m plem entation Programming Systems Lab 1 4
TSO-Atom icity Enforce TSO for Optim ization Physical Execution Logical Execution Upw ard-Exposed Loads Upw ard-Exposed Loads Other com putation p Dow nw ard-Exposed Stores • Logically, all upward-exposed loads execute atomically at load commit Logically, all upward exposed loads execute atomically at load commit • Non-upward-exposed loads get forwarded data • Logically, all downward-exposed stores execute atomically at store commit • Non-downward-exposed stores are overwritten N d d d t itt • The sem antic of region w ith TSO-atom icity is also independent of optim izations w ithin the region • Enforce TSO for aggressive program optimization gg p g p Programming Systems Lab 1 5
Conclusions and Future W orks � TSO-atomicity is a new HW primitive to enforce TSO for aggressive program optimization – Weaker than atomicity for efficient HW implementation – Weaker than atomicity for efficient HW implementation – Prevent unnecessary abort due to conflicts � Experimental results shows that TSO-atomicity is efficient to support dynamic binary optimizations – 20% performance improvement through dynamic binary optimization with TSO-atomicity support p y pp � It is interesting future works to study new HW primitives to enforce other relaxed memory consistency models for program optimizations consistency models for program optimizations – weaker atomicity than TSO-atomicity to enforce weaker memory consistency than TSO Programming Systems Lab 1 6
Recommend
More recommend