Work Stealing for Interac1ve Services to Meet Target Latency Jing Li ∗ , Kunal Agrawal ∗ , Sameh Elnikety†, Yuxiong He†, I-Ting Angelina Lee ∗ , Chenyang Lu ∗ , Kathryn S. McKinley† ∗ Washington University in St. Louis †MicrosoF Research * This work and was iniIated and partly done during Jing Li’s internship at MicrosoF Research in summer 2014.
Interac1ve services must meet a target latency Interactive services Search, ads, games, finance Users demand responsiveness
Interac1ve services must meet a target latency Interactive services Search, ads, games, finance Users demand responsiveness Problem setting Multiple requests arrive over time Each request: parallelizable Latency = completion time – arrival time Its latency should be less than a target latency T Goal: maximize the number of requests that meet � a target latency T
Latency in Internet search Ø In industrial interactive services, thousands of servers together serve a single user query. Ø End-to-end latency ≥ latency of the slowest server end-to-end response Ime (~ 100ms for user to find responsive) Doc lookup & ranking Target latency Parsing a Doc lookup Result aggrega1on & search query & ranking snippet genera1on . . . Doc lookup & ranking
Goal — Meet Target Latency in Single Server Ø Goal – design a scheduler to maximize the number of requests that can be completed within the target latency � (in a single server) Doc lookup & ranking Target latency Parsing a Doc lookup Result aggrega1on & search query & ranking snippet genera1on . . . Doc lookup & ranking
Sequen1al execu1on is insufficient Large request must execute in parallel to meet target latency constraint Target latency Request Sequen1al Execu1on Time (ms) ( work )
Full parallelism does not always work well Large request Target latency: 90ms 270 60 Small request
Full parallelism does not always work well Finish by 1me 90 Target latency: 90ms 270 Case 1 : 1 large request + 3 small requests 60 Finish by 60 1me 110 60 0 20 1me
Full parallelism does not always work well Finish by 1me 90 Target latency: 90ms 270 Case 1 : 1 large request + 3 small requests 60 Finish by 60 1me 110 60 Small requests are wai1ng 0 20 ✖ Miss 2 requests core 1 core 2 core 3 90 130 1me 110 150
Full parallelism does not always work well Finish by 1me 90 Target latency: 90ms 270 Case 1 : 1 large request + 3 small requests 60 Finish by 60 1me 110 60 0 20 ✖ ✔ Miss 2 requests Miss 1 request core 1 core 1 core 2 core 2 core 3 core 3 50 90 130 1me 110 1me 110 150 80 270
Some large requests require parallelism Finish by 1me 90 Target latency: 90ms 270 Case 2 : 1 large request + 1 small request 60 Finish by 1me 110 0 20 1me
Some large requests require parallelism Finish by 1me 90 Target latency: 90ms 270 Case 2 : 1 large request + 1 small request 60 Finish by 1me 110 0 20 1me ✔ ✖ Miss 0 request Miss 1 request core 1 core 1 core 2 core 2 core 3 core 3 80 90 1me 1me 110 270
Strategy: adapt scheduling to load Case 1 � ✔ Miss 1 request core 1 Cannot afford to run all large � core 2 requests in parallel core 3 50 110 1me 80 270 Case 2 ✔ Miss 0 request core 1 Do need to run some large � core 2 requests in parallel core 3 90 1me 110
Strategy: adapt scheduling to load High load run large requests sequentially � ✔ Miss 1 request core 1 Cannot afford to run all large � core 2 requests in parallel core 3 50 110 1me 80 270 Low load run all requests in parallel ✔ Miss 0 request core 1 Do need to run some large � core 2 requests in parallel core 3 90 1me 110
Why does the adap1ve strategy work? Latency = Processing Time + Waiting time At low load, processing time dominates latency q Parallel execution reduces request processing time q All requests run in parallel At high load, waiting time dominates latency q Executing a large request in parallel increases waiting time of many more later arriving requests q Each large request that is sacrificed helps to reduce waiting time of many more later arriving requests
Challenge: which request to sacrifice? Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge: which request to sacrifice? Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially Challenge 1 non-clairvoyant q We do not know the work of a request when it arrives Challenge 2 no accurate definition of large requests q Large is relative to instantaneous load
Challenge: which request to sacrifice? Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially Challenge 1 non-clairvoyant q We do not know the work of a request when it arrives Challenge 2 no accurate definition of large requests q Large is relative to instantaneous load q load = 10, large request >180ms � load = 20, large request > 80ms � load = 30, large request > 20ms
Contribu1ons Tail-control scheduler Tail-control offline threshold calculation Tail-control online runtime
Contribu1ons Tail-control scheduler Target latency T Input Request work distribuIon Available in highly engineered interacIve services Request per second (RPS) Tail-control offline threshold calculation Tail-control online runtime
Contribu1ons Tail-control scheduler Input Compute a large request Tail-control offline threshold for each load value threshold calculation Large request threshold table Tail-control online runtime
Contribu1ons Tail-control scheduler Input Tail-control offline threshold calculation Large request threshold table Tail-control Use threshold table to decide online runtime which request to serialize
Contribu1ons We modify work stealing to implement tail-control scheduling using Intel Thread Building Block Be\er performance
Contribu1ons Tail-control scheduler Input Tail-control offline threshold calculation Large request threshold table Implementation Tail-control details in the paper online runtime
Tail-control scheduler Input Threshold table Tail-control Tail-control offline threshold online calculation runtime Runtime functionalities: q Execute all requests in parallel to begin with q Record total amount of computation time spent on each request thus far q Detect large requests based on the current threshold and current processing time q Serializes large requests to limit their impact on other waiting requests
Work Stealing for Single Request Ø Workers’ local queues q Execute work, if there is any in local queue q Steal Workers 1 A execute 2 A parallelize 3
Generalize Work Stealing to Mul1ple Req. Ø Workers’ local queues + a global queue q Execute work, if there is any in local queue q Steal – further parallelize a request Workers q Admit – start executing a new request 1 A execute Parallelizable requests C B 2 A arrive at global queue admit parallelize 3
Implement Tail-Control in TBB Ø Workers’ local queues + a global queue q Execute work, if there is any in local queue q Steal – further parallelize a request Workers q Admit – start executing a new request 1 A execute Parallelizable requests C B 2 A arrive at global queue admit parallelize 3 Ø Steal-first (try to reduce processing time) Ø Admit-first (try to reduce waiting time) Ø Tail-control q Steal-first + long request detection & serialization
Evalua1on Ø Various request work distributions q Bing search q Finance server q Log-normal Ø Different request arrival q Poisson q Log-normal Ø Each setting:100,000 requests, plot target latency miss ratio Ø Two baselines (generalized from work stealing for single job) q Steal-first: tries to parallelize requests and reduce proc time q Admit-first: tries to admit requests and reduce waiting time
Improvement in target latency miss ra1o Be\er performance Hard à Easy to meet the target latency
Improvement in target latency miss ra1o Be\er performance Admit-first wins Steal-first wins Hard à Easy to meet the target latency Rela1ve load: high à low
Improvement in target latency miss ra1o Be\er performance
The inner workings of tail-control Target Latency
The inner workings of tail-control Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency
The inner workings of tail-control Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency
The inner workings of tail-control Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency
Tail-control performs well with inaccurate input
Tail-control performs well with inaccurate input Slightly inaccurate input work distribution is still useful less à more inaccurate input work distribu1on
Recommend
More recommend