life lessons and datacenter performance analysis
play

Life lessons and datacenter performance analysis Dan Ardelean Amer - PowerPoint PPT Presentation

Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller The need to solve performance crimes Performance crimes are anything that unnecessarily increase


  1. Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller

  2. The need to solve performance crimes Performance crimes are anything that unnecessarily increase ● Latency ● Resource usage Performance crimes... ● degrade end user experience ● waste valuable resources (energy, cost, etc.) Solving performance crimes is a necessity not a luxury This talk shares our experiences in solving performance crimes in Gmail Google Confidential and Proprietary

  3. But email is such a simple app … right? Your email Delivery service Storage system We have all used this...in the 1990s! Google Confidential and Proprietary

  4. Email today is a lot more... Spam filtering Virus detection Web client support Sync client support Backups Your email Smart search Delivery service Mail classification Labels/folder Filters Images/Attachments Contacts ... Storage system To enable sharing across applications (e.g., contacts) each component is a service Google Confidential and Proprietary

  5. Each component is a service running in its own processes Calendar Gmail RPC App Logic Authentication App Logic Events Contacts Body Provides modularity, parallelism, reliability Google Confidential and Proprietary

  6. Lesson 1: No RPC left behind Google Confidential and Proprietary

  7. Each user request involves O(100) RPCs We cannot ignore the rarely slow RPCs ● 1/100 slow RPC affects 63% of the requests ● 1/1M slow RPC affects 0.01% of the requests We cannot ignore the rarely slow requests ● 1/1M event affects O(10M) requests daily Blue moon...daily! Google Confidential and Proprietary

  8. Latency of RPCs follows a complex distribution Many layers of Continuously varying Countless code many servers load paths Fraction of RPCs Latency Google Confidential and Proprietary

  9. Even for “simplest” components data is rarely normal Log scale latency distribution of a critical (but “simple”) component at Google Fraction of RPCs Mean - sd 66% Mean Mean + sd Long tail Latency Abnormal is the new normal ! Google Confidential and Proprietary

  10. When is better actually better? Minimal overhead at every request; Constant overhead at every request; but expensive recovery but fast recovery Fraction of RPCs Latency A tighter distribution may be better than a better median Google Confidential and Proprietary

  11. Challenges Normal distribution are rare at Google ● Optimizing only for the “common case” is inadequate Many statistical truths assume properties of the data ● Ensure that analysis is appropriate for the shape of the data But...don’t invent statistics; just use it correctly! Google Confidential and Proprietary

  12. Lesson 2: Prepare for the storm

  13. Long tail also shows up in resource usage Must reserve resources for peaks CPU usage Time Two causes for peaks: (i) load; (ii) unusual events Google Confidential and Proprietary

  14. Cause 1: load varies over time Users are not “randomly” spread out over time zones Who is using Gmail here? Working overlaps between zones (e.g., NA and European) cause large usage peaks Google Confidential and Proprietary

  15. Cause 2: storms! What happened here? CPU usage Time Must plan for hardware and software updates Google Confidential and Proprietary

  16. Consequences of Lessons 1 and 2 Aggregate metrics (e.g., mean latency) cannot distinguish between: and We must reason with traces for long-tail events Google Confidential and Proprietary

  17. Challenges with traces ● Different traces may use different clocks ○ We use large amounts of data to do time alignment after the fact ● Reasoning is hard and laborious ○ Traces from a single machine may be 100K events per second ○ We use a language based on temporal logic to reason over traces ● Need to coordinate tracing across machines ○ Use coordinated-bursty tracing Profiling tools are (mostly) ok; now we must invest in tracing Google Confidential and Proprietary

  18. Lesson 3: Ask nicely! Google Confidential and Proprietary

  19. Is your pattern of access reasonable? my here you message go please Gmail Sync client all 50,000 Aaargh! messages NOW Gmail A single not-so-nice request can degrade many requests Google Confidential and Proprietary

  20. Some problems are more subtle Chicken and Egg Results for "Chicken and Egg" Results for "(Chicken OR Chickens OR ...) and Egg" Gmail Search Query rewriting can dramatically inflate a small request Google Confidential and Proprietary

  21. Any layer can potentially overwhelm the next layer Gmail front end Gmail front end Lots of small reads Fewer but larger reads Gmail storage layer Gmail storage layer Requests to a layer should match its strengths Google Confidential and Proprietary

  22. Challenges Our APIs capture functionality, not performance How do we know that a method that we are calling... ● ...makes expensive RPC calls? ● ...acquires locks? ● ...performs IO? Can we express and reason over performance in our APIs? Google Confidential and Proprietary

  23. Lesson 4: Share! Google Confidential and Proprietary

  24. Lock contention often causes long-tail latency A burst can cause contention when there is normally none Google Confidential and Proprietary

  25. Layer interactions may aggravate or alleviate contention # Requests Time Storage layer # Requests Time Low-level layer for accessing disk An inefficient layer presents an easier request stream to next layer Google Confidential and Proprietary

  26. Remember Little’s law! If processing a request holds a lock for 100ms we cannot process more than 10 requests per second 5 qps Hold for 0.1 sec Average queue depth: 5 * 0.1 = ½ If we want to double parallelism we must halve holding time Google Confidential and Proprietary

  27. Lesson 5: Confront weaknesses

  28. Caches can hide the latency of “weak” components ... Fast component Cache Slow component ...but it is often much better to fix the root cause! Google Confidential and Proprietary

  29. Why should we care if caches reduce latency? Gmail’s storage layer Cache File system layer Glass is half full... Wow, a 98.5% cache hit rate! Glass is half empty... Wow, why are we are asking for the same data 98.5% of the time? Google Confidential and Proprietary

  30. Fix the problem not the symptom If cache is performing too well, something is wrong with the requests Fix request pattern Remove cache Use less resources Cleaner code Caches are great abuse detectors! Google Confidential and Proprietary

  31. Lesson 6: Get your priorities right

  32. Priorities only matter in a crunch High prio Low prio Resource Long queues: big problem! Short queues: no problem! Overprovisioned resources mask poor priority settings Google Confidential and Proprietary

  33. Setting priorities is hard! High prio Low prio First try: put only user-facing requests in high priority Google Confidential and Proprietary

  34. ...but that can easily backfire when there are dependencies Queued Run Time Waiting for lower-priority req Run High prio Low prio Avoiding priority inversion makes for a complex priority model Google Confidential and Proprietary

  35. Lesson 7: Question everything

  36. Suspect every layer, every component, every lock, ... But even more so, look every gift horse in the mouth If data looks too good to be true, it is probably wrong! Google Confidential and Proprietary

  37. Rookie mistake: Compare Monday data to Sunday data! Requests per second Time The load varies day to day, hour to hour! Google Confidential and Proprietary

  38. Lesson 8 Real life is real life; tests are tests!

  39. Loadtest versus production CPU usage during loadtest CPU usage in production The loadtest executes same binaries...but the load is different Google Confidential and Proprietary

  40. Why the differences? ● Real user pattern is more complex than what we can synthesize ○ Users use a variety of email clients which affect requests ○ Users are different in their mailbox size and usage ■ e.g., Googlers are 10x more of everything compared to “average” user ○ Users have different settings (e.g., filters) and usage styles (clean- inbox versus everything-in-inbox) ○ ... ● The loadtest attempts to model these… ○ but it cannot possibly model everything Many performance problems must be debugged in production Google Confidential and Proprietary

  41. Debugging in production If only I knew X I could get to the bottom of this puzzle Sure… but only if ● Recording X does not violate any compliance, contractual, and privacy promises to our users ● It is carefully reviewed so that we do not introduce bugs or regressions ● And it still needs to go through a careful rollout process Challenge is to infer what we really need from what we have Google Confidential and Proprietary

  42. Call to action ● Trace analysis tools ○ How to combine and analyze diverse sources of data ○ How can we derive high-level knowledge from low-level data? ● Performance APIs ○ How can we express and check performance specifications? ● Focus on analyzing large production systems when possible ○ How can our students learn from and impact real systems? Are you up to it? Google Confidential and Proprietary

Recommend


More recommend