Life lessons and datacenter performance analysis Dan Ardelean Amer Diwan Rick Hank Christian Kurmann Balaji Raghavan Matt Seegmiller
The need to solve performance crimes Performance crimes are anything that unnecessarily increase ● Latency ● Resource usage Performance crimes... ● degrade end user experience ● waste valuable resources (energy, cost, etc.) Solving performance crimes is a necessity not a luxury This talk shares our experiences in solving performance crimes in Gmail Google Confidential and Proprietary
But email is such a simple app … right? Your email Delivery service Storage system We have all used this...in the 1990s! Google Confidential and Proprietary
Email today is a lot more... Spam filtering Virus detection Web client support Sync client support Backups Your email Smart search Delivery service Mail classification Labels/folder Filters Images/Attachments Contacts ... Storage system To enable sharing across applications (e.g., contacts) each component is a service Google Confidential and Proprietary
Each component is a service running in its own processes Calendar Gmail RPC App Logic Authentication App Logic Events Contacts Body Provides modularity, parallelism, reliability Google Confidential and Proprietary
Lesson 1: No RPC left behind Google Confidential and Proprietary
Each user request involves O(100) RPCs We cannot ignore the rarely slow RPCs ● 1/100 slow RPC affects 63% of the requests ● 1/1M slow RPC affects 0.01% of the requests We cannot ignore the rarely slow requests ● 1/1M event affects O(10M) requests daily Blue moon...daily! Google Confidential and Proprietary
Latency of RPCs follows a complex distribution Many layers of Continuously varying Countless code many servers load paths Fraction of RPCs Latency Google Confidential and Proprietary
Even for “simplest” components data is rarely normal Log scale latency distribution of a critical (but “simple”) component at Google Fraction of RPCs Mean - sd 66% Mean Mean + sd Long tail Latency Abnormal is the new normal ! Google Confidential and Proprietary
When is better actually better? Minimal overhead at every request; Constant overhead at every request; but expensive recovery but fast recovery Fraction of RPCs Latency A tighter distribution may be better than a better median Google Confidential and Proprietary
Challenges Normal distribution are rare at Google ● Optimizing only for the “common case” is inadequate Many statistical truths assume properties of the data ● Ensure that analysis is appropriate for the shape of the data But...don’t invent statistics; just use it correctly! Google Confidential and Proprietary
Lesson 2: Prepare for the storm
Long tail also shows up in resource usage Must reserve resources for peaks CPU usage Time Two causes for peaks: (i) load; (ii) unusual events Google Confidential and Proprietary
Cause 1: load varies over time Users are not “randomly” spread out over time zones Who is using Gmail here? Working overlaps between zones (e.g., NA and European) cause large usage peaks Google Confidential and Proprietary
Cause 2: storms! What happened here? CPU usage Time Must plan for hardware and software updates Google Confidential and Proprietary
Consequences of Lessons 1 and 2 Aggregate metrics (e.g., mean latency) cannot distinguish between: and We must reason with traces for long-tail events Google Confidential and Proprietary
Challenges with traces ● Different traces may use different clocks ○ We use large amounts of data to do time alignment after the fact ● Reasoning is hard and laborious ○ Traces from a single machine may be 100K events per second ○ We use a language based on temporal logic to reason over traces ● Need to coordinate tracing across machines ○ Use coordinated-bursty tracing Profiling tools are (mostly) ok; now we must invest in tracing Google Confidential and Proprietary
Lesson 3: Ask nicely! Google Confidential and Proprietary
Is your pattern of access reasonable? my here you message go please Gmail Sync client all 50,000 Aaargh! messages NOW Gmail A single not-so-nice request can degrade many requests Google Confidential and Proprietary
Some problems are more subtle Chicken and Egg Results for "Chicken and Egg" Results for "(Chicken OR Chickens OR ...) and Egg" Gmail Search Query rewriting can dramatically inflate a small request Google Confidential and Proprietary
Any layer can potentially overwhelm the next layer Gmail front end Gmail front end Lots of small reads Fewer but larger reads Gmail storage layer Gmail storage layer Requests to a layer should match its strengths Google Confidential and Proprietary
Challenges Our APIs capture functionality, not performance How do we know that a method that we are calling... ● ...makes expensive RPC calls? ● ...acquires locks? ● ...performs IO? Can we express and reason over performance in our APIs? Google Confidential and Proprietary
Lesson 4: Share! Google Confidential and Proprietary
Lock contention often causes long-tail latency A burst can cause contention when there is normally none Google Confidential and Proprietary
Layer interactions may aggravate or alleviate contention # Requests Time Storage layer # Requests Time Low-level layer for accessing disk An inefficient layer presents an easier request stream to next layer Google Confidential and Proprietary
Remember Little’s law! If processing a request holds a lock for 100ms we cannot process more than 10 requests per second 5 qps Hold for 0.1 sec Average queue depth: 5 * 0.1 = ½ If we want to double parallelism we must halve holding time Google Confidential and Proprietary
Lesson 5: Confront weaknesses
Caches can hide the latency of “weak” components ... Fast component Cache Slow component ...but it is often much better to fix the root cause! Google Confidential and Proprietary
Why should we care if caches reduce latency? Gmail’s storage layer Cache File system layer Glass is half full... Wow, a 98.5% cache hit rate! Glass is half empty... Wow, why are we are asking for the same data 98.5% of the time? Google Confidential and Proprietary
Fix the problem not the symptom If cache is performing too well, something is wrong with the requests Fix request pattern Remove cache Use less resources Cleaner code Caches are great abuse detectors! Google Confidential and Proprietary
Lesson 6: Get your priorities right
Priorities only matter in a crunch High prio Low prio Resource Long queues: big problem! Short queues: no problem! Overprovisioned resources mask poor priority settings Google Confidential and Proprietary
Setting priorities is hard! High prio Low prio First try: put only user-facing requests in high priority Google Confidential and Proprietary
...but that can easily backfire when there are dependencies Queued Run Time Waiting for lower-priority req Run High prio Low prio Avoiding priority inversion makes for a complex priority model Google Confidential and Proprietary
Lesson 7: Question everything
Suspect every layer, every component, every lock, ... But even more so, look every gift horse in the mouth If data looks too good to be true, it is probably wrong! Google Confidential and Proprietary
Rookie mistake: Compare Monday data to Sunday data! Requests per second Time The load varies day to day, hour to hour! Google Confidential and Proprietary
Lesson 8 Real life is real life; tests are tests!
Loadtest versus production CPU usage during loadtest CPU usage in production The loadtest executes same binaries...but the load is different Google Confidential and Proprietary
Why the differences? ● Real user pattern is more complex than what we can synthesize ○ Users use a variety of email clients which affect requests ○ Users are different in their mailbox size and usage ■ e.g., Googlers are 10x more of everything compared to “average” user ○ Users have different settings (e.g., filters) and usage styles (clean- inbox versus everything-in-inbox) ○ ... ● The loadtest attempts to model these… ○ but it cannot possibly model everything Many performance problems must be debugged in production Google Confidential and Proprietary
Debugging in production If only I knew X I could get to the bottom of this puzzle Sure… but only if ● Recording X does not violate any compliance, contractual, and privacy promises to our users ● It is carefully reviewed so that we do not introduce bugs or regressions ● And it still needs to go through a careful rollout process Challenge is to infer what we really need from what we have Google Confidential and Proprietary
Call to action ● Trace analysis tools ○ How to combine and analyze diverse sources of data ○ How can we derive high-level knowledge from low-level data? ● Performance APIs ○ How can we express and check performance specifications? ● Focus on analyzing large production systems when possible ○ How can our students learn from and impact real systems? Are you up to it? Google Confidential and Proprietary
Recommend
More recommend