Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory: Architectural Support for Lock-Free Data By Maurice Herlihy and J. Eliot B. Moss Structures 1993 By Maurice Herlihy and J Eliot B. Moss 1993 Presented by Minh Truong PSU May, 2013 Slide content heavily borrowed from Ashish Jh and Steve Coward PSU SP 2010 and 2011 CS-510
Agenda ● Overview of Lock - Base ● Overview of Lock - Free synchronization techniques ● Transactional Memory ○ Usage, ○ Implementation, ○ Cache Line states ○ Processor Actions ○ Snoopy Cache Actions ● Simulations ● Benchmarks & results ● Summary
Lock - Based synchronization ● Mutual Exclusion - only one threads at a time can access critical section ● Pessimistic approach ● Easy to uses ● Doesn't scale well do to test-and-set ● Cause problems as priority inversion, convoying and deadlock.
Lock - Free Synchronization and improvements ● Non-blocking ● Optimistic Approach ● Uses Read-Modify-Write (RMW) operations such as Compare-And-Swap (CAS) and LL&SC (Load link and Store Condition) ○ limited to operations on single-word or double-words ● Difficult program logic ● Doesn't have priority inversion, convoying and deadlock problems. What kinds of Improvement can we do to Lock-Free synchronization? Keeps the positive aspect of lock-free like non-blocking, and doesn't have ● issues such as priority inversion, convoying and deadlock. Allows RMW to operations on multiple, independently chosen words of ● memory Easy program logic ● Provide Mutual Exclusion ● This is where Transactional Memory comes in!!!
Transactional Memory "A transaction is a finite sequence of machine instructions, executed by a single process satisfying" the two properties below. (quote from paper) ● With serializability and atomicity properties the assumption here is even though both process can be running concurrently with every statements, only one process will be able to commit and the other will abort. ● Using these primitives Programmer can define customized read-modify-write operations that operate on ○ arbitrary regions of memory, not just with single words. ● Supporting non-transaction instruction, such as LOAD and STORE does not have any effect on the transaction read and write set.
Transactional Memory Primitive Instructions for Accessing Memory: Load-transactional(LT) - read value of shared memory location into private register Load-transactional-exclusive(LTX) - read value of shared memory location into private register and location to be updated. Store-transactional(ST)- write value from private register to shared memory location Primitive Instruction for manipulating transaction state: Commit (COMMIT) - make change permanent, Abort(ABORT) - discard all update to write set, Validate(VALIDATE) - test the current state.
Usage ● Intended to replace “short” critical section where there is no need to acquire lock, execute critical section and release_lock. ● It satisfies Atomicity and Serializability properties. ● Intended for activities that access small number of memory location in primary memory. ● Run complete within single scheduling quantum and the number of locations accessed not to exceed architecturally specified limit. Below is the code comparison between Transactional Memory and Non-block methods:
Implementation Design satisfies following criteria ● In absence of transactional memory, Non-transactional operation uses same caches, its control logic and coherency protocols ○ Otherwise uses two separate caches so traffic is isolated for non-transactional operations. Custom hardware support restricted to “primary caches” and instructions needed to ● communicate with them Committing or Aborting a transaction is an operation “local” to the cache ● Does not require communicating with other CPU’s or writing data back to memory ○ Transactional Memory exploits cache states associated with the cache coherency protocol Available on Bus based (Goodman's snoopy protocol) or network-based (Chaiken ● Directory protocol) architectures Cache State - could be in one the MESI (Modified, Exclusive, Shared, Invalid) states. ● SHARED - permitting READ when memory shared among all the processors. ● EXCLUSIVE - permitting WRITE but only exclusive to one processor. ● INVALID - not available to any processor. ● BASIC IDEA ● Any protocol capable of detecting accessibility conflicts can also detect transaction conflict at no extra cost. ● Implemented by modifying standard multiprocessor cache coherence protocols Apply this state logic with transaction counterparts ● If any transaction conflict detected - ABORT the transaction ● If transaction stalled - Use a timer or other sort of interrupt to abort the Tx ●
Cache Line States A dirty value “originally read” must either be write-back to memory, ● or set to XCOMMIT entry as its an “old” entry ○ this avoid continues writing back to memory which improve performance. ■ Transaction requests REFUSED by BUSY response this transaction will aborts and retries ● To prevents deadlock or continual mutual aborts Subject to starvation Every transaction operation takes 2 cache line entries ● Transactional cache cannot be big due to performance considerations - single cycle ○ abort/commit + cache management Therefore, only short transaction size supported ○
Processor Actions ● Non-transactional operations behave as in Goodman's original protocol. ● If action was issued by an aborted transaction then it will cause no bus cycle and may return arbitrary value. Other conditions for ABORT are interrupts and transaction cache overflow ● Commit does not force changes to memory it taken care of it (i.e. mem written) only ● when critical section(CL) is evicted or invalidated by cache coherence protocol (?)
Snoopy Cache actions Tables shows the response to each of the requests on both regular ● and transactional cache snoop on the bus. Either cache can issue a WRITE when it needs to replace a cache ● line. Main memory responds to all L1 read misses ● If TSTATUS==FALSE, Transactional cache acts as Regular cache (for ● NORMAL entries)
Simulations Bus-Based architecture (Goodman's snoopy protocol) ● Network-based architecture (Chaiken Directory Protocol) ● ● Used 32 processor ● Memory access (w/o contention) required 4 cycles. ● Network architecture used 2-stage network with wire and switch delay of 1 cycle each. ● Each access to regular or transaction cache, including commit and abort is a single cycle ○ reset transactional tag bits in parallel. ● Commit - not force newly-committed entries back to memory. Instead it replaced as they are evicted or invalidated ● Software mechanisms by the ongoing cache 1) test-and-test-and-set(TTS) coherence protocol. 2) spin locks with exponential backoff (MCS) ● Hardware mechanisms 1) LOAD_LINKED/STORE_COND (LL/SC) with exponential backoff (single-word counter benchmark, ran LL/SC directly on shared variable) 2) hardware queueing(QOSB) - incorporated into the cache coherence protocol
Counting Benchmark ● Each N processes increments a shared counter 2^16/n times. N range from 1 to 32 ● Two shared-memory accesses, high contention ● In absence of contention, TTS (same with SW & HW Queue) makes 5 references to mem for each increment ○ RD + test-and-set to acquire lock + RD and WR in CS ● TM requires only 3 mem accesses ○ RD & WR to counter and then COMMIT(no bus cycle)
Counting Results ● LL/SC doesn't required commit operation and without using lock therefore it perform the better.
Producer/Consumer Benchmark ● N processer share a bounded FIFO buffer. ● Half of the processors produce items and half consume items. ● 2^16 operations.
Producer/Consumer Benchmark Result ● Throughputs are flat for bus architecture network but transactional has higher throughput than other. ● Because of increase contention throughput suffer for network architecture. Transactional memory suffer less.
Doubly-Linked List Benchmark ● N processes share a doubly-linked list with head and tail pointer. ● Each transaction modifies head or tail but not both so one process can enqueues where it can't interference from other dequeuing. ● When queue is empty transaction must modify both head and tail which cause conflict. ● In lock-based algorithms will cause problem if both head and tail are not lock. ● Transaction uses VALIDATE to check for a pointer before dereference it.
Doubly - Linked List Result ● Lock-based algorithms throughput suffer because it doesn't allow enqueues and dequeues overlap.
Recommend
More recommend