ffwd delegation is much faster than you think
play

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - PowerPoint PPT Presentation

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { return ++seqno; } // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno() { acquire(lock); int


  1. ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

  2. int get_seqno() { 
 return ++seqno; 
 } // ~1 Billion ops/s // single-threaded

  3. int threadsafe_get_seqno() { 
 acquire(lock); 
 int ret=++seqno; 
 release(lock); 
 return ret; 
 } // < 10 Million ops/s

  4. 10 MCS MUTEX TTAS CLH TAS 8 Throughput (Mops) 6 4 2 0 0 16 32 48 64 80 96 112 128 Hardware threads

  5. why so slow?

  6. ~70 ns intra-socket latency ~14 Mops

  7. QPI Quad-Socket QPI QPI Intel System 150-300 M cachelines/s 10-20 gigabyte/s QPI ~200ns O/W latency = 5 million ops per second

  8. THREAD 1 THREAD 2 critical section i n t e r c o n wait for lock n e c t l a t e n c y wait for lock 400x critical section interconnect latency wait for lock critical section

  9. THREAD 1 THREAD 2 THREAD3 critical section wait for lock wait for lock critical section 600x wait for lock critical section wait for lock wait for lock critical section

  10. dedicated server client thread client client client

  11. DEDICATED CLIENT1 SERVER THREAD r e q u e s t wait for request 400x e wait for s n critical section o response p s e r wait for request wait for response critical section

  12. DEDICATED CLIENT1 CLIENTn SERVER THREAD r e q u e s request t wait for request 400x e wait for wait for s n critical section o response response p s e r r critical section still! e s p o n s e wait for request wait for wait for response response critical section critical section

  13. DEDICATED CLIENT1 CLIENTn SERVER THREAD wait for request 400x wait for wait for critical section response response wait for critical section still! response wait for critical section response wait for wait for critical section response response wait for critical section response wait for critical section response wait for critical section request wait for wait for response response wait for critical section response wait for critical section response wait for critical section response wait for critical section response wait for

  14. ( f ast, f ly- w eight d elegation) ffwd design READ SERVER CLIENTS client request read write all & act on all WRITE responses, one line per write request thread group 0 thread group N thread to server requests client request for each of N thread groups read write all spin on server & act on all responses, shared server response response thread group 1 thread group 0 WRITE READ requests one line per group of 15 threads shared server response (128 bytes) client request (64 bytes) toggle toggle function arg return values[15] argv[6] bits bit pointer count Server acts upon pending requests in One dedicated 64-byte request line, batches 15 clients. per client-server pair Each group of 15 clients shares one Requests are sent synchronously 128-byte response line pair.

  15. A request in more detail client request (64 bytes) toggle function arg argv[6] bit pointer count shared server response (128 bytes) toggle return values[15] bits • request is new if: request toggle bit != response toggle bit • server calls function with (64-bit) arguments provided • client polls response line until toggle bit == response bit

  16. delegation server thread group 0 requests 0 0 1 . . local ..111 response bu ff er

  17. delegation server thread group 0 requests 0 0 1 . . local ..110 response bu ff er

  18. delegation server thread group 0 requests 0 0 1 . . local ..100 response bu ff er

  19. delegation server thread group 0 requests 0 0 1 . . local ..100 response bu ff er

  20. delegation server thread group 0 requests 0 0 1 . . local ..100 ..100 response bu ff er global response bu ff er modified <———————response cache lines———————> shared

  21. group 0 requests 0 0 1 . . local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

  22. group 0 requests shared <—-requests-—> modified 0 0 1 . . local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

  23. shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

  24. shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

  25. shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

  26. performance evaluation

  27. evaluation systems 4 × 16-core Xeon E5-4660, Broadwell, 2.2 GHz 4 × 8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4 × 8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4 × 8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz

  28. application benchmarks • Same benchmarks as in Lozi et al. (RCL) [USENIX ATC’12] • programs that spend large % of time in critical sections • Except BerkeleyDB - ran out of time

  29. raytrace-car (SPLASH-2) 600 FFWD MUTEX FC MCS TAS RCL 500 400 Duration (ms) 300 200 100 mutex ff wd 0 MCS RCL 0 16 32 48 64 80 96 112 128 # of threads RCL experienced correctness issues above 82 threads.

  30. application benchmarks • comparing best performance (any thread count) for all methods • up to 2.5x improvement over pthreads, any thread count • 10+ times speedup at max thread count

  31. memcached-set 300 FFWD MCS MUTEX TAS RCL 250 pthread 200 mutex Duration (sec) 150 100 50 0 0 16 32 48 64 80 96 112 128 # of threads RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work.

  32. microbenchmarks

  33. • ff wd is much faster on largely sequential data structures • linked list (coarse locking), stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, ff wd falls behind when the lock contention is low • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree

  34. naïve 1024-node linked-list, coarse-grained locking 1.2 FFWD MUTEX TICKET TAS STM MCS TTAS CLH HTICKET 1 Throughput (Mops) 0.8 0.6 0.4 0.2 0 0 16 32 48 64 80 96 112 128 Hardware threads

  35. two-lock queue 60 FFWD TTAS HTICKET CC MS MCS TICKET FC DSM SIM MUTEX CLH RCL H BLF 50 Throughput (Mops) 40 30 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads

  36. stack 60 FFWD TTAS HTICKET CC LF MCS TICKET FC DSM SIM MUTEX CLH RCL H BLF 50 Throughput (Mops) 40 30 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads

  37. fetch-and-add, 1 variable 60 FFWD TTAS TAS RCL MCS TICKET HTICKET ATOMIC MUTEX CLH FC 50 Throughput (Mops) 40 30 hardware-provided atomic increment instruction! 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads

  38. • ff wd is much faster on largely sequential data structures • naïve linked list, stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, fwd falls behind when there are many locks • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree

  39. 128-thread hash table 225 FFWD MUTEX TICKET TAS MCS TTAS CLH HTICKET 200 175 Throughput (Mops) 150 125 100 75 50 25 0 1 4 16 64 256 1024 #buckets locking takes the lead when #locks ~ #threads

  40. fetch-and-add, 128 threads 400 FFWD MUTEX TICKET TAS FC MCS TTAS CLH HTICKET RCL 350 300 Throughput (Mops) 250 200 150 100 50 0 1 4 16 64 256 1024 # of shared variables (locks)

  41. fetch-and-add, 128 threads 400 FFWD TTAS TAS RCL MCS TICKET HTICKET ATOMIC 350 MUTEX CLH FC 300 Throughput (Mops) 250 atomic increment —> 200 150 100 50 0 1 4 16 64 256 1024 # of shared variables (locks)

  42. • ff wd is much faster on largely sequential data structures • naïve linked list, stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, once lock# is similar to thread#, fwd falls behind • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree

  43. 128-thread lazy concurrent lists 60 FFWD-LZ TTAS-LZ TAS-LZ FC-LZ MCS-LZ TICKET-LZ HTICKET-LZ RCL-LZ MUTEX-LZ CLH-LZ HARRIS 50 Throughput (Mops) 40 30 20 10 0 1 4 16 64 256 1024 4096 16384 #elements

Recommend


More recommend