UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and - PowerPoint PPT Presentation

1 UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel The University of Texas at Austin

Multicore is here 2 This laptop Intel Single-chip Cloud Computer 2 Intel cores 48 cores Tilera Tile GX 100 cores  Only concurrent applications will perform better on new hardware

Concurrent programming is hard 3  Locks are the state of the art  Correctness problems: deadlock, priority inversion, etc.  Scaling performance requires more complexity  Transactional memory makes correctness easy  Trade correctness problems for performance problems  Key challenge: performance tuning transactions  This work:  Develops a TM performance model and tool  Systems integration challenges for TM

Simple microbenchmark 4 lock(); xbegin(); if(rand() < threshold) if(rand() < threshold) shared_var = new_value; shared_var = new_value; unlock(); xend();  Intuition:  Transactions execute optimistically  TM should scale at low contention threshold  Locks always execute serially

Ideal TM performance 5 3 xbegin(); if(rand() < threshold) 2.5 Execution Time (s) shared_var = new_value; 2 xend(); 1.5  Performance win at low 1 Locks 32 CPUs contention 0.5 Ideal TM 32 CPUs 0  Higher contention 0 10 20 30 40 50 60 70 80 90 100 degrades gracefully Probability of Conflict (%) Lower is better Ideal, not real data

Actual performance under contention 6 3 xbegin(); if(rand() < threshold) 2.5 Execution Time (s) shared_var = new_value; 2 xend(); 1.5  Comparable 1 Locks 32 CPUs performance at modest 0.5 TM 32 CPUs contention 0 0 10 20 30 40 50 60 70 80 90 100  40% worse at 100% Probability of Conflict (%) contention Lower is better Actual data

First attempt at microbenchmark 7 3 xbegin(); if(rand() < threshold) 2.5 Execution Time (s) shared_var = new_value; 2 xend(); 1.5 1 Locks 32 CPUs 0.5 TM 32 CPUs 0 0 10 20 30 40 50 60 70 80 90 100 Probability of Conflict (%) Lower is better Approximate data

Subtle sources of contention 8 if(a < threshold) Microbenchmark code shared_var = new_value; eax = shared_var; gcc optimized code if(edx < threshold) eax = new_value; shared_var = eax;  Compiler optimization to avoid branches  Optimization causes 100% restart rate  Can’t identify problem with source inspection + reason

Developers need TM tuning tools 9  Transactional memory can perform pathologically  Contention  Poor integration with system components  HTM “best effort” not good enough  Causes can be subtle and counterintuitive  Syncchar: Model that predicts TM performance  Predicts poor performance remove contention  Predicts good performance + poor performance system issue

This talk 10  Motivating example  Syncchar performance model  Experiences with transactional memory  Performance tuning case study  System integration challenges

The Syncchar model 11  Approximate transaction performance model  Intuition: scalability limited by serialized length of critical regions  Introduce two key metrics for critical regions:  Data Independence: Likelihood executions do not conflict  Conflict Density: How many threads must execute serially to resolve a conflict  Model inputs: samples critical region executions  Memory accesses and execution times

Data independence (I n ) 12  Expected number of non-conflicting, concurrent executions of a critical region. Formally: I n = n - |C n | n =thread count C n = set of conflicting critical region executions  Linear speedup when all critical regions are data independent ( I n = n )  Example: thread-private data structures  Serialized execution when ( I n = 0 )  Example: concurrent updates to a shared variable

Example: 13 Thread 1 Read a Write a Thread 2 Write a Read a Thread 3 Write a Write a Time  Same data independence (0)  Different serialization

Conflict density (D n ) 14  Intuition: Low density High density Thread 1 Write a Read a Write a Thread 2 Read a Thread 3 Write a Write a Time  How many threads must be serialized to eliminate a conflict?  Similar to dependence density introduced by von Praun et al. [PPoPP ‘07]

Syncchar metrics in STAMP 15 12 Conflict Density Projected Speedup over Locking 10 Data Independence 8 6 4 2 0 8 16 32 8 16 32 8 16 32 8 16 32 intruder kmeans bayes ssca2 Higher is better

Predicting execution time 16  Speedup limited by conflict density  Amdahl’s law: Transaction speedup limited to time executing transactions concurrently   n   = ÷ + Execution _ Time cs _ cycles other     max( D , 1 ) n cs_cycles = time executing a critical region other = remaining execution time D n = Conflict density

Syncchar tool 17  Implemented as Simics machine simulator module  Samples lock-based application behavior  Predicts TM performance  Features:  Identifies contention “hot spot” addresses  Sorts by time spent in critical region  Identifies potential asymmetric conflicts between transactions and non-transactional threads

Syncchar validation: microbenchmark 18 3 Execution Time (s) 2.5 2 1.5 Locks 8 CPUs 1 TM 8 CPUs 0.5 Syncchar 0 0 10 20 30 40 50 60 70 80 90 100 Probability of Conflict (%) Lower is better  Tracks trends, does not model pathologies  Balances accuracy with generality

Syncchar validation: STAMP 19 intruder 8CPU intruder 16CPU intruder 32CPU ssca2 8CPU ssca2 16CPU Predicted ssca2 32CPU Measured 0 0.5 1 1.5 2 Execution Time (s)  Coarse predictions track scaling trend  Mean error 25%  Additional benchmarks in paper

Syncchar summary 20  Model: data independence and conflict density  Both contribute to transactional speedup  Syncchar tool predicts scaling trends  Predicts poor performance remove contention  Predicts good performance + poor performance system issue  Distinguishing high contention from system issues is key step in performance tuning

This talk 21  Motivating example  Syncchar performance model  Experiences with transactional memory  Performance tuning case study  System integration challenges

TxLinux case study 22  TxLinux – modifies Linux synchronization primitives to use hardware transactions [SOSP 2007] 14 % Kernel Time Spent Synchronizing 12 aborts 10 spins 8 6 4 2 0 Linux TxLinux-xs TxLinux-cx Linux TxLinux-xs TxLinux-cx Linux TxLinux-xs TxLinux-cx Linux TxLinux-xs TxLinux-cx Linux TxLinux-xs TxLinux-cx Linux TxLinux-xs TxLinux-cx pmake bonnie++ mab find config dpunish 16 CPUs – graph taken from SOSP talk Lower is better

Bonnie++ pathology 23  Simple execution profiling indicated ext3 file system journaling code was the culprit  Code inspection yielded no clear culprit  What information missing?  What variable causing the contention  What other code is contending with the transaction  Syncchar tool showed:  Contended variable  High probability (88-92%) of asymmetric conflict

Bonnie++ pathology, explained 24 struct lock(buffer->state); bufferhead ... { xbegin(); … ... assert(locked(buffer->state)); bit state; Tx R ... bit dirty; W xend(); bit free; ... … unlock(buffer->state); };  False asymmetric conflicts for unrelated bits  Tuned by moving state lock to dedicated cache line

Tuned performance – 16 CPUs 25 >10 s 1.2 TxLinux 1 Execution Time (s) TxLinux Tuned 0.8 0.6 0.4 0.2 0 bonnie++ MAB pmake radix Lower is better  Tuned performance strictly dominates TxLinux

This talk 26  Motivating example  Syncchar performance model  Experiences with transactional memory  Performance tuning case study  System integration challenges  Compiler (motivation)  Architecture  Operating system

HTM designs must handle TLB misses 27  Some best effort HTM designs cannot handle TLB misses  Sun Rock  What percent of STAMP txns would abort for TLB misses?  2% for kmeans  50-100%  How many times will these transactions restart?  3 (ssca2)  908 (bayes)  Practical HTM designs must handle TLB misses

Input size 28  Simulation studies need scaled inputs  Simulating 1 second takes hours to weeks  STAMP comes with parameters for real and simulated environments

Input size 29 Speedup normalized to 1 CPU – Higher is better 30 Big 25 Sim 20 Speedup 15 10 5 0 8 16 32 8 16 32 8 16 32 genome ssca2 yada  Simulator inputs too small to amortize costs of scheduling threads

System calls – memory allocation 30 Legend Allocated Free Thread 1 xbegin(); Heap malloc(); Pages: 2 xend(); Common case behavior: Rollback of transaction rolls back heap bookkeeping

System calls – memory allocation 31 Legend Allocated Free Thread 1 xbegin(); Heap malloc(); Pages: 2 Pages: 3 xend(); Uncommon case behavior: Allocator adds pages to heap Rolls back bookkeeping, leaking pages Pathological memory leaks in STAMP genome and labyrinth benchmark

UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and - PowerPoint PPT Presentation

1 UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel The University of Texas at Austin Multicore is here 2 This laptop Intel Single-chip Cloud Computer 2 Intel cores 48 cores Tilera Tile GX 100 cores

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 25 November 2016 Lecture 8

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 27 November 2015 Lecture 8

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Transactional memory with data Transactional memory with data invariants: or putting the

Hardware Transactional Memory Shao-Hung Chiu, Upasana Sridhar Transactional Memory - Where did

Evaluating the Impact of Transactional Characteristics on the Performance of Transactional Memory

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Transactional Memory 1 To read more This days papers: Herlihy and Moss, Transactional

Extending Hardware Transactional Memory to Support Non-busy Waiting and Non-transactional Actions

Verification of Transactional Memories that support Non-Transactional Memory Accesses Ariel Cohen

Time-Warp: Lightweight Abort Minimization in Transactional Memory Nuno Diegues and Paolo Romano

Approximately Opaque Multi-version Permissive Transactional Memory Basem Assiri Costas Busch

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

DHTM: Durable Hardware Transactional Memory Arpit Joshi , Vijay Nagarajan, Marcelo Cintra, Stratis

Enhancing Permissiveness of Transactional Memory via Time-Warp Nuno Diegues and Paolo Romano

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

QtWidgets and QtQuick.Controls - A Comparison Qt Developer Days Europe 2014 Presented by Kevin

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

CS 501: TA Training Seminar Neeraj Kumar cs.ucsb.edu/ leadta CS 501: TA Training Seminar

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are

COOPERATION INSTEAD OF CONTENTION! THE NEBULOUS CONCEPT OF WIRELESS LINK. Network

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de