user level threading have your cake and eat it too
play

User-level Threading: Have Your Cake and Eat It Too Martin Karsten - PowerPoint PPT Presentation

Problem Statement Fred Runtime Evaluation Wrap Up User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R. Cheriton School of Computer Science University of Waterloo June 2020 SIGMETRICS 2020 1/27


  1. Problem Statement Fred Runtime Evaluation Wrap Up User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R. Cheriton School of Computer Science University of Waterloo June 2020 SIGMETRICS 2020 1/27

  2. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions SIGMETRICS 2020 2/27

  3. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions event-based programming • explicit state management • asynchronous control flow → callback hell SIGMETRICS 2020 2/27

  4. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions event-based programming • explicit state management • asynchronous control flow → callback hell thread-per-session programming • automatic state management • synchronous control flow SIGMETRICS 2020 2/27

  5. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions event-based programming • explicit state management • asynchronous control flow → callback hell thread-per-session programming • automatic state management • synchronous control flow ⇒ performance ? SIGMETRICS 2020 2/27

  6. Problem Statement Fred Runtime Evaluation Wrap Up Background parallel hardware → threads & synchronization SIGMETRICS 2020 3/27

  7. Problem Statement Fred Runtime Evaluation Wrap Up Background parallel hardware → threads & synchronization kernel thread caveats • limit: typically 10Ks • (some) execution overhead • complex scheduling for fairness & control SIGMETRICS 2020 3/27

  8. Problem Statement Fred Runtime Evaluation Wrap Up Background parallel hardware → threads & synchronization kernel thread caveats • limit: typically 10Ks • (some) execution overhead • complex scheduling for fairness & control ⇒ user-level threads! • key aspect: scheduling • requirement: user-level I/O blocking SIGMETRICS 2020 3/27

  9. Problem Statement Fred Runtime Evaluation Wrap Up Take Away user-level threads • similar throughput to event-based programming • load balancing can sometimes reduce tail latency SIGMETRICS 2020 4/27

  10. Problem Statement Fred Runtime Evaluation Wrap Up Take Away user-level threads • similar throughput to event-based programming • load balancing can sometimes reduce tail latency kernel threads not that bad either • up to a limit SIGMETRICS 2020 4/27

  11. Problem Statement Fred Runtime Evaluation Wrap Up Take Away user-level threads • similar throughput to event-based programming • load balancing can sometimes reduce tail latency kernel threads not that bad either • up to a limit Fred Runtime rules! SIGMETRICS 2020 4/27

  12. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 5/27

  13. Problem Statement Fred Runtime Evaluation Wrap Up Problem Statement minimum overhead of user-level threading? SIGMETRICS 2020 6/27

  14. Problem Statement Fred Runtime Evaluation Wrap Up Problem Statement minimum overhead of user-level threading? roadmap • build minimum viable user-level threading runtime • compare to state of the art threading runtimes • evaluate production-grade application SIGMETRICS 2020 6/27

  15. Problem Statement Fred Runtime Evaluation Wrap Up Approach Application Application vs Event Handling Thread Runtime SIGMETRICS 2020 7/27

  16. Problem Statement Fred Runtime Evaluation Wrap Up Approach Application Application vs Event Handling Thread Runtime Memcached - in-memory key/value store • minimum port to thread-per-session • fully preserved state machine • no structural benefits SIGMETRICS 2020 7/27

  17. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 8/27

  18. Problem Statement Fred Runtime Evaluation Wrap Up Scheduler performance: simple and lightweight scalability: local queueing effectiveness: load sharing efficiency: idle-sleep SIGMETRICS 2020 9/27

  19. Problem Statement Fred Runtime Evaluation Wrap Up Inverse Shared Ready Stack Ready−Queue 1 benaphore processor ring (for stealing) Processor 1 V() fred Ready−Queue 2 counter P() Processor 2 Ready−Queue 3 Processor 3 Staging−Queue waiting processors "processor ready−stack" SIGMETRICS 2020 10/27

  20. Problem Statement Fred Runtime Evaluation Wrap Up I/O Blocking automatically suspend thread during I/O wait essential for synchronous control flow suspend/resume user-level thread • user-level synchronization primitives • OS-level notifications SIGMETRICS 2020 11/27

  21. Problem Statement Fred Runtime Evaluation Wrap Up I/O Notifications poller input OS query event loop epoll/kqueue interest set output freds I/O Synchronization Vector (indexed by FD) SIGMETRICS 2020 12/27

  22. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 13/27

  23. Problem Statement Fred Runtime Evaluation Wrap Up Threading Benchmarks comparison of 9 different threading runtimes performance & scalability problems • Arachne, Mordor, µ C++ efficiency problems • Arachne, Boost, Qthreads • busy-looping scheduler solid results • Fred, Libfiber, Pthreads • Go: higher constant scheduling overhead SIGMETRICS 2020 14/27

  24. Problem Statement Fred Runtime Evaluation Wrap Up Performance 10 Libfiber Qthreads Fred Throughput x10 7 (32 Cores) 8 Pthread Go Boost 6 Arachne Mordor uC++ 4 2 0 0 5 10 15 20 25 30 35 40 Duration of Each Work Unit (us) SIGMETRICS 2020 15/27

  25. Problem Statement Fred Runtime Evaluation Wrap Up Efficiency 300 Libfiber Pthread Arachne Qthreads Go Mordor 250 Fred Boost uC++ Cost of Iteration (us) 200 150 100 50 0 0 5 10 15 20 25 30 Core Count SIGMETRICS 2020 16/27

  26. Problem Statement Fred Runtime Evaluation Wrap Up I/O Benchmarks I/O stress test for Fred, Go, Libfiber, Pthread compared to best-in-class event-based server • Libfiber breaks • Go and Pthread limited • only Fred competitive SIGMETRICS 2020 17/27

  27. Problem Statement Fred Runtime Evaluation Wrap Up I/O Scalability 1600 ULib Fred (8 poller freds) Request Throughput (x1000/sec) 1400 Pthread Go 1200 uC++ 1000 800 600 400 200 0 0 5 10 15 20 25 30 Cores SIGMETRICS 2020 18/27

  28. Problem Statement Fred Runtime Evaluation Wrap Up Application Benchmarks SIGMETRICS 2020 19/27

  29. Problem Statement Fred Runtime Evaluation Wrap Up Application Benchmarks only Fred competitive with original Memcached tail latency results from Arachne paper • only apply to special case: #RX queues < #cores • performance of Pthread for low connection count! SIGMETRICS 2020 19/27

  30. Problem Statement Fred Runtime Evaluation Wrap Up Throughput 800 Fred Vanilla 700 Query Throughput (x1000/sec) Pthread Arachne 600 Fred (shared RQ) 500 400 300 200 100 0 0 2 4 6 8 10 12 14 16 Cores SIGMETRICS 2020 20/27

  31. Problem Statement Fred Runtime Evaluation Wrap Up Throughput - more connections 700 Fred Vanilla 600 Query Throughput (x1000/sec) Pthread Fred (shared RQ) 500 Arachne 400 300 200 100 0 0 2 4 6 8 10 12 14 16 Cores SIGMETRICS 2020 21/27

  32. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Arachne Results 10000 Vanilla (pin/rfs) Read Latency (us), 99th Percentile Fred (pin) Arachne Pthread (rfs) 1000 100 10 0 200 400 600 800 1000 Query Throughput (x1000) SIGMETRICS 2020 22/27

  33. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Explanation original experiment: 8 RX queues for 12 cores head-of-line blocking? modified setup: 16 RX queues for 12 cores tail latency discrepancies largely gone... SIGMETRICS 2020 23/27

  34. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Regular 10000 Vanilla (pin) Read Latency (us), 99th Percentile Fred (pin) Arachne Pthread 1000 100 10 0 200 400 600 800 1000 Query Throughput (x1000) SIGMETRICS 2020 24/27

  35. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Higher Connection Count 1,536 → 7,680 connections 100000 Vanilla (pin) Read Latency (us), 99th Percentile Fred (pin) Arachne Pthread 10000 1000 100 10 0 100 200 300 400 500 600 700 800 900 Query Throughput (x1000) SIGMETRICS 2020 25/27

  36. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 26/27

  37. Problem Statement Fred Runtime Evaluation Wrap Up Wrap Up Fred: nimble user-level threading runtime comprehensive performance evaluation user-level threading possible at low overhead scenarios with improved performance? Fred currently the best reference platform SIGMETRICS 2020 27/27

Recommend


More recommend