how to make decisions optimally
play

How to Make Decisions (Optimally) Siddhartha Sen Microsoft - PowerPoint PPT Presentation

How to Make Decisions (Optimally) Siddhartha Sen Microsoft Research NYC AI for Systems Vision: Infuse AI to optimize cloud infrastructure decisions, while being: Minimally disruptive (agenda: Harvesting Randomness ) Synergistic with


  1. How to Make Decisions (Optimally) Siddhartha Sen Microsoft Research NYC

  2. AI for Systems • Vision: Infuse AI to optimize cloud infrastructure decisions, while being: • Minimally disruptive (agenda: Harvesting Randomness ) • Synergistic with human solutions (agenda : HAIbrid algorithms ) • Safe and reliable (agenda : Safeguards ) • Impact: Above criteria differentiate us, ensure wider‐spread impact • Team: • MSR NYC, MSR India • Azure: Azure Compute, Azure Frontdoor • Universities: Columbia, NYU, Princeton, Yale, UBC, U. Chicago, Cornell

  3. Vision: Safe optimization without disruption Fuzzer Fuzzer Evaluate alternatives without disrupting? System System Optimizer Optimizer Safeguard Safeguard (complex) (complex)

  4. Roadmap • A framework for making systematic decisions: Reinforcement Learning • A way to reason about decisions in the past: Counterfactual Evaluation • How to make this work in cloud systems? • Successes, fundamental obstacles, workarounds

  5. Decisions in the real world action policy context reward reward Which policy maximizes my total reward?

  6. Reinforcement learning (RL) reward Which policy maximizes my total reward?

  7. Example: online news articles (MSN) article on top user, browse history clicked/ignored

  8. Example: machine health (Azure cloud) wait time before reboot machine, failure hist total downtime

  9. Example: commute options bike, subway, car weather, traffic trip time, cost

  10. Example: online dating match user, dating hist length of relationship

  11. Reinforcement learning reflects real life • Traditional (supervised) machine learning needs the answer as input �  dog, cat, … � � � gives you the full answer train a model: �  �

  12. Reinforcement learning reflects real life • RL interacts with environment, learns from feedback action � �, � only gives a partial answer context � train a policy: �  � reward �

  13. How to learn in an RL setting? • Explore to learn about new actions • Incorporate reward feedback • Do this systematically! (Humans are not good at this)

  14. Simple example: online news articles Humans are Humans are bad at this bad at this Policy A (Career) Clicked Humans are Humans are bad at this bad at this Policy B (Location) Ignored This is an A/B test!

  15. Simple example: online news articles Policy A (Career) Clicked Policy space Giant table Policy B (Location) Ignored RL: richer policy space, richer representation

  16. Aside: Deep Reinforcement Learning! • Superhuman ability in Go, Chess • Lots of engineering/tweaking • Learning from self‐play not new • Far from AI apocalypse • But (opinion): a glimpse of a more subtle, subconscious overtaking

  17. Policy A (e.g. Career) Clicked Policy space Giant table Policy B (e.g. Location) Ignored

  18. Testing policies online is inefficient Policy A (e.g. Career) Clicked Policy space Giant table Policy B (e.g. Location) Ignored • Costly (prod deployment) • Risky (live user traffic) • Slow (split 100% of traffic)

  19. Testing policies online is inefficient Policy A (e.g. Career) Clicked Policy space Giant table Policy B (e.g. Location) Ignored Instead: randomize directly over actions Problem: randomizing over policies Collect data first, then evaluate policies after‐the‐fact

  20. Test policies offline! … Clicked Clicked Ignored Clicked Later evaluate Gender policy: Later evaluate Location policy: Later evaluate Career policy: Engineer Engineer Engineer Teacher Texas Seattle Seattle Texas Female Male Female Male

  21. Counterfactual evaluation (testing policies offline) • Ask “what if” questions about the past: how would this new policy have performed if I had run it? • Basic idea: Use (randomized) decisions made by a deployed policy to match/evaluate decisions the new policy would make: ������� • Problem: deployed policy’s decisions may be biased

  22. Counterfactual evaluation (testing policies offline) • Ask “what if” questions about the past: how would this new policy have performed if I had run it? • Basic idea: Use (randomized) decisions made by a deployed policy to match/evaluate decisions the new policy would make: Use probabilities to over/underweight decisions ������� • Test many different policies on the same dataset, offline!

  23. RL + Counterfactual Evaluation • Very powerful combination: evaluate a billion policies offline, find the best one • Exponential boost over online A/B testing Can we apply this paradigm Can we apply this paradigm to cloud systems? to cloud systems?

  24. Example: machine health (Azure Compute) wait time before reboot machine, failure hist total downtime

  25. Example: TCP config (Azure Frontdoor) TCP parameter settings OS, locale, traffic type CPU utilization

  26. Example: replica selection (Azure LB) replica to handle request req, replica loads latency

  27. What if… • … we waited a different amount of time before rebooting? • … we used different TCP settings on an edge proxy machine? • … we sent a request to a different replica? Counterfactual evaluation! Counterfactual evaluation!

  28. Counterfactual evaluation in Systems • Opportunity: Many systems are naturally randomized • Load balancing, data replicas, cache eviction, fault handling, etc. • When we need to spread things, when choices are ambiguous  Free exploration! • Opportunity: Many systems provide implicit feedback • Naïve defaults, conservative parameter settings • Worse settings yield more information  Free feedback!

  29. Counterfactual evaluation in Systems Challenge Technique Mess of methods/techniques Taxonomy spanning multiple disciplines Huge action spaces (coverage) Spatial coarsening Stateful, non‐independent Temporal coarsening, decisions Time horizons Dynamic environments (Baseline normalization)

  30. Taxonomy for counterfactual evaluation Full Supervised Learning Feedback? Feedback? Randomize/explore Direct method Partial No Randomization? Randomization? ? Yes Reinforcement Learning Yes Independent Independent (contextual bandits) decisions? decisions? Reinforcement Learning No Unbiased estimator (DR) (general) Unbiased estimator + time horizon (DR‐Time)

  31. Example: Machine health in Azure Compute • Wait for some time, then reboot …

  32. Example: Machine health in Azure Compute • Wait for some time, then reboot • Wait for {1,2,…,10 min} Spatial coarsening …

  33. Example: Machine health in Azure Compute Decision? Action [‐]Reward Machine A Wait 10 min 5 min Machine B Wait 10 min 3 min Machine C Wait 10 min 10 min + reboot … … … …

  34. Example: Machine health in Azure Compute Decision? Action [‐]Reward Feedback Machine A Wait 10 min 5 min Wait 1,2,…,9 Machine B Wait 10 min 3 min Wait 1,2,…,9 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … … …

  35. Example: Machine health in Azure Compute Decision? Action [‐]Reward Feedback Machine A Wait 6 min 5 min Wait 1,2,…,9 Machine B Wait 2 min 3 min Wait 1,2,…,9 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … … …

  36. Example: Machine health in Azure Compute Decision? Action [‐]Reward Feedback Machine A Wait 6 min 5 min Wait 1,2,…,9 Machine B Wait 2 min 2 min + reboot Wait 1 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … … Implicit feedback …

  37. Results: Machine health in Azure Compute DR DR + Implicit feedback

  38. Results: Machine health in Azure Compute

  39. Example: TCP config in Azure Frontdoor • TCP parameters: Mumbai, India • initial cwnd Cloud Datacenter Edge proxy • initial RTO Clients cluster Service 1 • min RTO endpoint • max SYN retransmit Atlanta, USA resp Service 1 WAN Edge proxy • delayed ACK freq req endpoint cluster • delayed ACK timeout Service 2 endpoint Clients

  40. Example: TCP config in Azure Frontdoor • TCP parameters: Mumbai, India • initial cwnd Cloud Datacenter Edge proxy • initial RTO Clients cluster Service 1 • min RTO endpoint • max SYN retransmit Atlanta, USA resp Service 1 WAN Edge proxy • delayed ACK freq req endpoint cluster • delayed ACK timeout Service 2 • Pick from 17 different endpoint Clients configurations, per hour per machine Spatial/temporal coarsening

  41. Example: TCP config in Azure Frontdoor • Dynamic workload and Mumbai, India environment Cloud Datacenter Edge proxy Clients • Assign “control” machine cluster Service 1 to each RL machine as endpoint Atlanta, USA baseline, report delta resp Service 1 WAN Edge proxy req endpoint cluster Service 2 endpoint Clients

  42. Example: TCP config in Azure Frontdoor • Dynamic workload and Mumbai, India environment Cloud Datacenter Clients • Assign “control” machine Service 1 to each RL machine as endpoint Atlanta, USA baseline, report delta resp Service 1 WAN req endpoint Service 2 Baseline normalization endpoint Clients

  43. Results: TCP config in Azure Frontdoor Estimate Reward Error Ground truth 0.713 ‐‐ DR 0.720 (0.637, 0.796) 0.97%

  44. Lesson: Unbiased estimator vs. biased policy Configuration 1 2 (default) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Recommend


More recommend