Measuring and Optimizing Tail Latency Kathryn S McKinley, Google CRA-W Undergraduate Town Hall April 5 th , 2018
Speaker & Moderator The image part with relationship ID rId3 was not found in the file. Lori Pollock Kathryn S McKinley Dr. Lori Pollock is a Professor in Computer and Dr. Kathryn S. McKinley is a Senior Research Information Sciences at University of Scientist at Google and previously was a Delaware. Her current research focuses on Researcher at Microsoft and an Endowed program analysis for building better software Professorship at The University of Texas at Austin. maintenance tools, software testing, energy- Her research spans programming languages, efficient software and computer science compilers, runtime systems, architecture, education. Dr. Pollock is an ACM Distinguished performance, and energy. She and her Scientist and was awarded the University of collaborators have produced several widely used Delaware’s Excellence in Teaching Award and tools: the DaCapo Java Benchmarks (30,000+ the E.A. Trabant Award for Women’s Equity. downloads), the TRIPS Compiler, Hoard memory manager, MMTk memory management toolkit, and the Immix garbage collector. She served as program chair for ASPLOS, PACT, PLDI, ISMM, and CGO. She is currently a CRA and CRA-W Board member. Dr. McKinley was honored to testify to the House Science Committee (Feb. 14, 2013). She is an IEEE and ACM Fellow. She has graduated 22 PhD students.
Measuring and Optimizing Tail Latency Kathryn S McKinley, Google Xi Yang, Stephen M Blackburn, Md Haque, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
Tail Latency Matters TOP PRIORITY 400 millisecond delay decreased Two second slowdown reduced searches/user by 0.59%. [Jack Brutlag, Google] revenue/user by 4.3%. [Eric Schurman, Bing] 4
5 Photo: Google/Connie Zhou
Datacenter economics quick facts* ~ $500,000 Cost of small datacenter ~3,000,000 US datacenters in 2016 ~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year ~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter *Shehabi et al., United States Data Center Energy 6 Usage Report, Lawrence Berkeley, 2016.
TOP PRIORITY Tail Latency Efficiency 8 8
BOTH ?! Tail Latency Efficiency 9 9
Server architecture client aggregator workers 10
Characteristics of interactive services 100 5 LC 4.5 Percentage of requests 80 4 Bursty, diurnal 3.5 60 3 CDF changes slowly 2.5 Slowest server dictates tail 40 2 1.5 Orders of magnitude diff 20 1 average & tail - 99th %tile 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 11
What is in the tail? 100 5 4.5 Percentage of requests 80 4 3.5 60 3 2.5 ? 40 2 1.5 20 1 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 12
Cycle-level on-line profiling tool [ISCA’15 (Top Picks HM), ATC’16] Insight Hardware & software generate signals without instrumentation 4 HT1 IPC counters tags HT1 0 4 Core IPC performance ✓ ✓ 4 0 counters HT2 SHIM IPC HT2 0 memory ✓ ✓ HT1 IPC = Core IPC – HT2 SHIM IPC locations 13
What is in the tail? 100 5 4.5 Percentage of requests 80 4 3.5 60 3 2.5 ? 40 2 1.5 20 1 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 14
The Tail Longest 200 requests } noise Network & other Network imperfections Network and networking queueing time Idle OS imperfections Idle time } 120 CPU work Long requests not noise CPU time Queuing at worker Overload 100 Dispatch queueing time latency (ms) latency 80 60 40 20 0 0 50 100 150 200 15 Top 200 requests
Optimizing the tail Diagnosing the tail with continuous profiling No Noise ise systems are not perfect too much load is bad, but so is over Queuing Queuing provisioning Wo Work many requests are long In Insights Use the CDF off line Long requests reveal themselves, treat them specially 16
Insight Long requests reveal themselves Regardless of the cause 17
Noise Replicate & reissue The Tail at Scale, Dean & Barroso, CACM’13 All requests? 100 5 4.5 5% reissued Percentage of requests 80 4 CFD for cost & potential 3.5 10 % reissued 60 3 2.5 Fixed issue time 40 2 1.5 noise 20 1 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 18 18
Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17 Adding randomness to 100 5 reissue makes one earlier 4.5 5% reissued Percentage of requests reissue time d (vs n) optimal 80 4 3.5 60 3 Probability is proportional to 1-3% reissue w/ prob. p 2.5 reissue budget & noise in tail 40 2 1.5 noise 20 1 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 19 19
Single R Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17 20
Work Speed up the tail efficiently Judicious parallelism 100 5 [ASPLOS’15] 4.5 Percentage of requests DVFS faster on the tail 80 4 [DISC’14, MICRO’17] 3.5 Asymmetric multicore 60 3 2.5 [DISC’14, MICRO’17] 40 2 1.5 work 20 1 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 21
Work Parallelism Parallelism historically for throughput Idea Parallelism for tail latency 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 23
Parallelism Parallelism historically for throughput Idea Parallelism for tail latency 0 0 0 0 Insight Long requests reveal themselves 0 0 0 0 0 0 0 0 Approach Incrementally add parallelism to long requests – the tail – based on request progress & load 24
Few to Many Sequential xed: add thread every d ms Fi Fixe 4 way Dynamic: u : use l load Tail latency ms Fixed interval 20 ms 1500 Fixed interval 100 ms Fixed interval 500 ms 1200 0 0 0 0 900 0 0 0 0 short delay good at best at 0 0 0 0 low load all loads 600 long delay good at 300 high load 30 32 34 36 38 40 42 44 46 48 Lucene RPS
Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB Dynamic parallelism 1500 Sequential Few to Many Tail latency ms 1200 21% fewer servers 900 600 or reduce tail by 28% 300 30 32 34 36 38 40 42 44 46 48 Requests per Second
Work speed up the tail efficiently Judicious parallelism 100 5 [ASPLOS’15] ✔ 4.5 Percentage of requests 80 4 3.5 60 3 2.5 40 2 1.5 work 20 1 0.5 0 0 0 2 4 6 8 1 0 0 0 0 0 0 Latency (ms) 27
BOTH ! Tail Latency Efficiency 28 28
Efficiency at scale for interactive workloads Diagnosing the tail with continuous profiling No Noise ise replication, systems are not perfect replication + judicious choice Queuing Queuing Wo Work judicious use of resources on long requests Request latency CDF is a powerful tool Tail efficiency ≠ average or throughput Hardware heterogeneity Questions? 29
Professional and Research Relationships
Your Academic Village • Peer students • Students senior & junior to you • Teaching assistants • PhD students • Faculty
My Professional Village • Researchers in all career stages – Undergrads, PhD students, post docs – Faculty, industrial researchers, staff, administrators • Industrial village – Software engineers in all career stages – Managers, directors, admins, – in/out my management chain
Faculty Mentors Don Johnson Ken Kennedy Dave Stemple My Professor PhD Advisor Dept. Chair
Building a Village
Networking is…. Building and sustaining professional relationships Participating in an academic / research community • Finding people you like and you learn from, and building a • relationship
Networking is not …. Using people • A substitute for quality work •
But I am Horrible at Small Talk • You have CS in common • Networking is not genetic • It is a research skill – Practice – Meet people – Learn – Go places – Volunteer! – Sustain your relationships
With whom do you network? People you like • People senior to you, who can show you the way • People at different career stages, so you can anticipate • Your peers •
Peer Mentors Mary Hall Doug Burger Margaret Martonosi
Your Village Will • Write letters for grad school, jobs, etc. • Help you solve problems • Point you in good directions • Encourage you • Choose you for important roles • You will do the same or more for them • Make your life and work more fun and meaningful
Recommend
More recommend