Overview • Limitations of lock-based programming • Transactional memory – Programming with TM 6 • Transactional Memory – Software TM (STM)‐ – Hardware TM (HTM)‐ Chip Multiprocessors (ACS MPhil)‐ Robert Mullins Chip Multiprocessors (ACS MPhil)‐ 2 Lock-based programming Lock-based programming • Lock-based programming is a low-level model • Challenges: – Close to basic hardware primitives – Must remember to use (the correct)‐ locks – For some problems lock-based solutions that perform • Careful to avoid when not required (for performance)‐ well are complex and error-prone – Coarse-grain vs. fine-grain locks • difficult to write, debug, and maintain • Simplicity • Not true of all problems • Unnecessary serialisation of operations • Parallel programming for the masses – Lock may not actually be required in most cases (data dependent)‐. Lock-based programming may be pessimistic. – The majority of programmers will need to be able to • We must also consider the time taken to acquire and release produce highly parallel and robust software locks! (even uncontended locks have a cost)‐ – What is the optimal granularity of locking? HW dependent. Chip Multiprocessors (ACS MPhil)‐ 3 Chip Multiprocessors (ACS MPhil)‐ 4
Lock-based programming Lost wake-up example • Other issues: push mutex::scoped_lock lock (pushMutex) – Deadlock queue.push(item) – Scheduling threads if (queue.size()==1) m_emptyCond.notify_one() • Priority inversion (e.g. Mars Rover Pathfinder problems)‐ – Low-priority thread is preempted (while holding a lock)‐ pop – Medium-priority thread runs // (implicit lock release when leaving scope) – High-priority thread (needing the lock)‐ can't make progress mutex::scoped_lock lock (popMutex) • Convoying while (queue.empty()) – Thread holding lock is descheduled, a queue of threads form m_emptyCond.wait() – lost wake-ups (wait on CV, but forget to signal)‐ Item = queue.front() – Horribly complicated error recovery queue.pop() – Cannot even easily compose lock based programs return item Chip Multiprocessors (ACS MPhil)‐ 5 Chip Multiprocessors (ACS MPhil)‐ 6 Lock-based programming Lock-based programming • Avoiding deadlock // Trivial deadlock example // Thread 1 // Thread 2 – Requires programmer to adopt some sort of policy (although this a.lock(); b.lock(); isn't automatically enforced)‐ b.lock(); a.lock(); – Often difficult to maintain/understand ... ... • Lock hierarchies • Deadlock – All code must take locks in the same order – We are free to do anything when we hold a lock, even – Lock chaining – take first lock, take second, release first, etc. take a lock on another mutex • Try and back off – This can quickly lead to deadlock if we are not careful – More flexible than imposing a fixed order – Get first lock • Limiting ourselves to only being able to take a single lock at a time would force us to use coarse-grain locks – Then try and lock additional mutexes in the required set. If we fail release locks and retry • e.g. consider maintaining two queues. These are each accessed by many different threads. We are infrequently • pthread_mutex_trylock required to transfer data from one queue to the other (atomically)‐ Chip Multiprocessors (ACS MPhil)‐ 7 Chip Multiprocessors (ACS MPhil)‐ 8
Lock-based programming Transactions • Composing lock-based programs atomic { x=q0.deq(); – Consider our example of two queues q1.enq(x); – There is no simple way of dequeuing from one and } enqueuing to the other in an atomic fashion • We would need to expose synchronization state and force • Focus on where atomicity is necessary rather than caller to manage locks specific locking mechanisms – Can't compose methods that block either (wait/notify)‐ • The transactional memory system will ensure that the • How do we describe the operation where we want to dequeue from either queue, whichever has data transaction is run in isolation from other threads • Each queue implementation blocks internally – Transactions are typically run in parallel optimistically – If transactions perform conflicting memory accesses, we must abort and ensure none of the side-effects of the abandoned transactions are visible Chip Multiprocessors (ACS MPhil)‐ 9 Chip Multiprocessors (ACS MPhil)‐ 10 Transactions Transactions • Atomicity (all-or-nothing)‐ void Queue::enq (int v) { – We guarantee that it appears that either all the atomic { instructions are executed or none of them are (if the // queue is full transaction fails, failure atomicity )‐ if (count==MAX_LEN) retry ; buf[tail]=v; – The transaction either commits or aborts if (++tail == MAX_LEN) tail=0; count++; • Transactions execute in isolation } } – Other operations cannot access a transaction's • Retry intermediate state. – Abandon transaction and try again – The result of executing concurrent transactions must – An implementation could wait until some changes be identical to a result in which the transactions occur in memory locations read by the aborted executed sequentially ( serializability )‐ transaction • Or specify a specific watch set [Atomos/PLDI'06] “ Composable memory transactions ”, Harris et al. Chip Multiprocessors (ACS MPhil)‐ 11 Chip Multiprocessors (ACS MPhil)‐ 12
Transactions Critical sections ≠ transactions • Converting critical sections to transactions – pitfall: “ A critical section that was previously atomic atomic { only with respect to other critical sections guarded by x = q0.deq(); the same lock is now atomic with respect to all other } orElse { critical sections. ” x = q1.deq(); } proc1 { proc2 { acquire (m1) acquire (m2) • Choice while (!flagA) {} flag A=true flagB = true while (!flagB) {} – Try to dequeue from q0 first, if this retries .... .... release(m1) release(m2) (i.e. queue is empty)‐, then try the second } } – If both retry, retry the whole orElse block “ Deconstructing Transactional Semantics: The Subtleties of Atomicity ” Colin Blundell. E Christopher Lewis. Milo M. K. Martin,WDDD, 2005)‐ “ Composable memory transactions ”, Harris et al. Chip Multiprocessors (ACS MPhil)‐ 13 Chip Multiprocessors (ACS MPhil)‐ 14 Implementating a TM system Hardware support for TM • Transaction granularity • An introduction to hardware mechanisms for supporting transactional memory – Object, word or block – See Larus/Rajwar book for a more complete survey • How do we provide isolation? – We'll look at: – Direct or deferred update ? • Knight, “An architecture for mostly functional languages”, in • Update object directly and keep undo log LFP, 1986. • Update private copy, discard or replace object • A simple HTM with lazy conflict detection – Also called eager and lazy versioning • Herlihy/Moss (1993)‐ • When and how do we detect conflicts? – Discuss others in reading group – Eager or lazy conflict detection ? • A software or hardware-supported implementation? Chip Multiprocessors (ACS MPhil)‐ 15 Chip Multiprocessors (ACS MPhil)‐ 16
Hardware support for TM Hardware support for TM • 1. Tom Knight (1986)‐ – Not really a TM scheme , Knight describes a scheme for parallelising the execution of a single thread – Blocks are identified by the compiler and executed in parallel assuming there are no memory carried dependencies between them – Hardware support is provided to detect memory dependency violations – This work introduces the basic ideas of using caches and the cache coherence protocol to support TM Larus/Rajwar book p.140 [Knight86] Chip Multiprocessors (ACS MPhil)‐ 17 Chip Multiprocessors (ACS MPhil)‐ 18 Hardware support for TM Hardware support for TM • Confirm Cache • Dependency Cache – A block executes to completion and then commits. – The dependency cache holds data read from memory. Blocks are committed in the original program order Data read during a block is held in state D (Depends)‐ • Any data written in the block is temporarily held in the confirm • A memory dependency violation is detected if a bus write cache (not visible to other processors)‐. This is swept and (made by a block that is currently committing)‐ updates a written back during commit. value in a dependency cache in state D • On a processor read, priority is given to the data in the • This indicates that the block read the data too early and must commit cache be aborted – The block needs to see any writes it has made [Knight86] [Knight86] Chip Multiprocessors (ACS MPhil)‐ 19 Chip Multiprocessors (ACS MPhil)‐ 20
Recommend
More recommend