Self-Tuning Intel TSX 3rd Euro-TM Workshop on Transactional Memory Nuno Diegues and Paolo Romano to appear on the 11th USENIX ICAC 2014
Using TSX _xbegin � // your transactional code � _xend
Using TSX _xbegin � // your transactional code � _xend May Abort
Using TSX _xbegin � // your transactional code � _xend Data contention • Forbidden • instructions May Abort Hardware buffers’ • capacity Signals and faults •
Using TSX _xbegin � // your transactional code � _xend Data contention • Forbidden • instructions May Abort Hardware buffers’ • Transparently � capacity Restarts Signals and faults •
Using TSX Best-effort nature � we cannot rely exclusively on TSX
Best-effort nature Not *that* specific to Intel TSX. IBM HTMs apply partly here too
Best-effort nature Not *that* specific to Intel TSX. IBM HTMs apply begin: partly here too unsigned int status = _xbegin if (status == ok) goto code � goto begin � � � code: // your transactional code � � _xend �
Best-effort nature Not *that* specific to Intel TSX. IBM HTMs apply begin: � partly here too unsigned int status = _xbegin � if (status == ok) � goto code goto code // fast path � � goto begin � � � � � � � code: � // your transactional code � � � � � _xend _xend // fast path � �
Best-effort nature Not *that* specific to Intel TSX. IBM HTMs apply begin: begin: � partly here too unsigned int status = _xbegin unsigned int status = _xbegin � if (status == ok) if (status == ok) � goto code goto code goto code // fast path // fast path if (shouldRetry) // retry policy � � goto begin goto begin � � � � � � � � � � code: code: � // your transactional code // your transactional code � � � � if (shouldRetry) � � _xend _xend _xend // fast path // fast path � � �
Best-effort nature Not *that* specific to Intel TSX. IBM HTMs apply begin: begin: begin: � partly here too unsigned int status = _xbegin unsigned int status = _xbegin unsigned int status = _xbegin � if (status == ok) if (status == ok) if (status == ok) � goto code goto code goto code goto code // fast path // fast path // fast path if (shouldRetry) // retry policy if (shouldRetry) // retry policy � � goto begin goto begin goto begin � else � � � acquire(lock) // fallback � � � � � � � code: code: code: � // your transactional code // your transactional code // your transactional code � � � � � if (shouldRetry) if (shouldRetry) � � _xend _xend _xend _xend // fast path // fast path // fast path else � � � release(lock) // fallback
Best-effort nature Not *that* specific to Intel TSX. IBM HTMs apply begin: begin: begin: � partly here too unsigned int status = _xbegin unsigned int status = _xbegin unsigned int status = _xbegin � if (status == ok) if (status == ok) if (status == ok) � goto code goto code goto code goto code // fast path // fast path // fast path if (shouldRetry) // retry policy if (shouldRetry) // retry policy � � goto begin goto begin goto begin � else � � � acquire(lock) // fallback � � � � � � � Transactions need code: code: code: � to be aware of this // your transactional code // your transactional code // your transactional code � � � � � if (shouldRetry) if (shouldRetry) � � _xend _xend _xend _xend // fast path // fast path // fast path else � � � release(lock) // fallback
Summary of issues • Lemming effect � • Number of attempts � • Retry policy � • Management of fall-back
Summary of issues wait-stubborn-4 GCC (Possible) Self-Tuning 3 wait-giveup-4 wait-stubborn-11 2 speedup wait-stubborn-4 wait-half-8 wait-half-11 aux-giveup-3 none-giveup-1 1 0 1 2 3 4 5 6 7 8 threads Genome from STAMP suite
Number of attempts Kmeans from STAMP high contention 4 low contention speedup 2 1 2 4 6 retries 12 14 16
Number of attempts Kmeans from STAMP high contention 4 low contention � t speedup n e c s e D t n e i d a r G n o i t a r o l p x e r o f 2 1 2 4 6 retries 12 14 16
Gradient Descent tuning the number of attempts performance ? optimization round #attempts
Gradient Descent tuning the number of attempts performance ? 1 optimization round #attempts
Gradient Descent tuning the number of attempts performance ? 1 optimization round #attempts randomly search some direction; explore it while profitable
Gradient Descent tuning the number of attempts performance 2 ? 1 optimization round #attempts randomly search some direction; explore it while profitable
Gradient Descent tuning the number of attempts performance 3 2 4 ? 1 optimization round #attempts randomly search some direction; explore it while profitable
Gradient Descent tuning the number of attempts performance 3 2 4 ? 1 optimization round #attempts randomly search some direction; explore it while profitable revert direction when not profitable
Gradient Descent tuning the number of attempts threshold for stabilization performance 3 2 4 ? 1 optimization round #attempts randomly search some direction; explore it while profitable revert direction when not profitable
Gradient Descent tuning the number of attempts threshold for stabilization performance 5 3 2 4 ? 1 random optimization jump round #attempts randomly search some direction; explore it while profitable revert direction when not profitable random jumps to avoid local minima
Gradient Descent tuning the number of attempts threshold for stabilization 6 performance 5 3 2 4 ? 1 random optimization jump round #attempts randomly search some direction; explore it while profitable revert direction when not profitable random jumps to avoid local minima
Gradient Descent tuning the number of attempts threshold for stabilization 6 performance 5 3 2 4 ? 1 random optimization jump 7 round #attempts randomly search some direction; explore it while profitable revert direction when not profitable random jumps to avoid local minima
Gradient Descent tuning the number of attempts memorize maxima threshold for stabilization 6 performance 5 3 2 4 ? recover from 1 random unlucky jumps optimization jump 7 round #attempts randomly search some direction; explore it while profitable revert direction when not profitable random jumps to avoid local minima
Retry policy • Give up on capacity aborts? • How should we “consume” the attempts’ budget? • How to manage the fall-back?
Retry policy � g n i n r a e l t n e m e c r o f n d i e n R u o B e c n e d fi n o C r e p p U
UCB tuning the retry policy ? ? ? Lever A Lever B Lever C
UCB tuning the retry policy ? ? ? Lever A Lever B Lever C A quest for exploration vs benefit from current knowledge
UCB tuning the retry policy ? ? ? Lever A Lever B Lever C A quest for exploration vs benefit from current knowledge UCB adapts the strategy to maximize reward Logarithmic bound on the optimization error
UCB tuning the retry policy Model the belief about capacity aborts: • giveup — exhaust attempts • half — drops half the attempts • stubborn — decrements attempts Reward: function of processor cycles (RDTSC)
Adaptation of one atomic block in Yada
Adaptation of one atomic block in Yada optimizers � are *not* � independent
Transparency to the User fetch atomic block's stats yes Profile cycles atomic_begin Re-optimize? no govern retry management retry abort fetch last Begin Tx procedure configuration gcc libitm application execute logic atomic block gcc libitm changes next yes configuration Re-optimize? atomic_end Profile cycles no Run grad() End Tx Run ucb() continue Procedure program
Transparency to the User fetch atomic block's stats yes Profile cycles atomic_begin Re-optimize? no govern retry management retry abort fetch last Begin Tx procedure configuration gcc libitm application execute logic atomic block gcc libitm changes next yes configuration Re-optimize? atomic_end Profile cycles no Run grad() End Tx Run ucb() continue Procedure program
Transparency to the User fetch atomic block's stats yes Profile cycles atomic_begin Re-optimize? no govern retry management retry abort fetch last Begin Tx procedure configuration gcc libitm application execute logic atomic block gcc libitm changes next yes configuration Re-optimize? atomic_end Profile cycles no Run grad() End Tx Run ucb() continue Procedure program
Summary of Evaluation
Summary of Evaluation
Peek view on results Intruder from STAMP “ideal” 4 self-tuning 3 speedup 2 1 1 2 3 4 5 6 7 8 threads
Peek view on results Yada with 8 threads GCC Heuristic 3 throughput (1000 txs/sec) AdaptiveLocks Tuner 2 1 5 execution time (sec) 20 25 benchmark finished
Recommend
More recommend