ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
int get_seqno() { return ++seqno; } // ~1 Billion ops/s // single-threaded
int threadsafe_get_seqno() { acquire(lock); int ret=++seqno; release(lock); return ret; } // < 10 Million ops/s
10 MCS MUTEX TTAS CLH TAS 8 Throughput (Mops) 6 4 2 0 0 16 32 48 64 80 96 112 128 Hardware threads
why so slow?
~70 ns intra-socket latency ~14 Mops
QPI Quad-Socket QPI QPI Intel System 150-300 M cachelines/s 10-20 gigabyte/s QPI ~200ns O/W latency = 5 million ops per second
THREAD 1 THREAD 2 critical section i n t e r c o n wait for lock n e c t l a t e n c y wait for lock 400x critical section interconnect latency wait for lock critical section
THREAD 1 THREAD 2 THREAD3 critical section wait for lock wait for lock critical section 600x wait for lock critical section wait for lock wait for lock critical section
dedicated server client thread client client client
DEDICATED CLIENT1 SERVER THREAD r e q u e s t wait for request 400x e wait for s n critical section o response p s e r wait for request wait for response critical section
DEDICATED CLIENT1 CLIENTn SERVER THREAD r e q u e s request t wait for request 400x e wait for wait for s n critical section o response response p s e r r critical section still! e s p o n s e wait for request wait for wait for response response critical section critical section
DEDICATED CLIENT1 CLIENTn SERVER THREAD wait for request 400x wait for wait for critical section response response wait for critical section still! response wait for critical section response wait for wait for critical section response response wait for critical section response wait for critical section response wait for critical section request wait for wait for response response wait for critical section response wait for critical section response wait for critical section response wait for critical section response wait for
( f ast, f ly- w eight d elegation) ffwd design READ SERVER CLIENTS client request read write all & act on all WRITE responses, one line per write request thread group 0 thread group N thread to server requests client request for each of N thread groups read write all spin on server & act on all responses, shared server response response thread group 1 thread group 0 WRITE READ requests one line per group of 15 threads shared server response (128 bytes) client request (64 bytes) toggle toggle function arg return values[15] argv[6] bits bit pointer count Server acts upon pending requests in One dedicated 64-byte request line, batches 15 clients. per client-server pair Each group of 15 clients shares one Requests are sent synchronously 128-byte response line pair.
A request in more detail client request (64 bytes) toggle function arg argv[6] bit pointer count shared server response (128 bytes) toggle return values[15] bits • request is new if: request toggle bit != response toggle bit • server calls function with (64-bit) arguments provided • client polls response line until toggle bit == response bit
delegation server thread group 0 requests 0 0 1 . . local ..111 response bu ff er
delegation server thread group 0 requests 0 0 1 . . local ..110 response bu ff er
delegation server thread group 0 requests 0 0 1 . . local ..100 response bu ff er
delegation server thread group 0 requests 0 0 1 . . local ..100 response bu ff er
delegation server thread group 0 requests 0 0 1 . . local ..100 ..100 response bu ff er global response bu ff er modified <———————response cache lines———————> shared
group 0 requests 0 0 1 . . local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared
group 0 requests shared <—-requests-—> modified 0 0 1 . . local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared
shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared
shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared
shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared
performance evaluation
evaluation systems 4 × 16-core Xeon E5-4660, Broadwell, 2.2 GHz 4 × 8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4 × 8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4 × 8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz
application benchmarks • Same benchmarks as in Lozi et al. (RCL) [USENIX ATC’12] • programs that spend large % of time in critical sections • Except BerkeleyDB - ran out of time
raytrace-car (SPLASH-2) 600 FFWD MUTEX FC MCS TAS RCL 500 400 Duration (ms) 300 200 100 mutex ff wd 0 MCS RCL 0 16 32 48 64 80 96 112 128 # of threads RCL experienced correctness issues above 82 threads.
application benchmarks • comparing best performance (any thread count) for all methods • up to 2.5x improvement over pthreads, any thread count • 10+ times speedup at max thread count
memcached-set 300 FFWD MCS MUTEX TAS RCL 250 pthread 200 mutex Duration (sec) 150 100 50 0 0 16 32 48 64 80 96 112 128 # of threads RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work.
microbenchmarks
• ff wd is much faster on largely sequential data structures • linked list (coarse locking), stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, ff wd falls behind when the lock contention is low • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree
naïve 1024-node linked-list, coarse-grained locking 1.2 FFWD MUTEX TICKET TAS STM MCS TTAS CLH HTICKET 1 Throughput (Mops) 0.8 0.6 0.4 0.2 0 0 16 32 48 64 80 96 112 128 Hardware threads
two-lock queue 60 FFWD TTAS HTICKET CC MS MCS TICKET FC DSM SIM MUTEX CLH RCL H BLF 50 Throughput (Mops) 40 30 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads
stack 60 FFWD TTAS HTICKET CC LF MCS TICKET FC DSM SIM MUTEX CLH RCL H BLF 50 Throughput (Mops) 40 30 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads
fetch-and-add, 1 variable 60 FFWD TTAS TAS RCL MCS TICKET HTICKET ATOMIC MUTEX CLH FC 50 Throughput (Mops) 40 30 hardware-provided atomic increment instruction! 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads
• ff wd is much faster on largely sequential data structures • naïve linked list, stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, fwd falls behind when there are many locks • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree
128-thread hash table 225 FFWD MUTEX TICKET TAS MCS TTAS CLH HTICKET 200 175 Throughput (Mops) 150 125 100 75 50 25 0 1 4 16 64 256 1024 #buckets locking takes the lead when #locks ~ #threads
fetch-and-add, 128 threads 400 FFWD MUTEX TICKET TAS FC MCS TTAS CLH HTICKET RCL 350 300 Throughput (Mops) 250 200 150 100 50 0 1 4 16 64 256 1024 # of shared variables (locks)
fetch-and-add, 128 threads 400 FFWD TTAS TAS RCL MCS TICKET HTICKET ATOMIC 350 MUTEX CLH FC 300 Throughput (Mops) 250 atomic increment —> 200 150 100 50 0 1 4 16 64 256 1024 # of shared variables (locks)
• ff wd is much faster on largely sequential data structures • naïve linked list, stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, once lock# is similar to thread#, fwd falls behind • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree
128-thread lazy concurrent lists 60 FFWD-LZ TTAS-LZ TAS-LZ FC-LZ MCS-LZ TICKET-LZ HTICKET-LZ RCL-LZ MUTEX-LZ CLH-LZ HARRIS 50 Throughput (Mops) 40 30 20 10 0 1 4 16 64 256 1024 4096 16384 #elements
Recommend
More recommend