CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - PowerPoint PPT Presentation

IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN

ABOUT ME • 6 years at Twitter, on cache • maintainer of Twemcache & Twitter’s Redis fork • operations of thousands of machines • hundreds of (internal) customers • Now working on Pelikan, a next-gen cache framework to replace the above @twitter • Twitter: @thinkingfish

THE PROBLEM: CACHE PERFORMANCE

CACHE RULES SERVICE EVERYTHING AROUND ME CACHE DB

😤 CACHE RUINS SERVICE EVERYTHING AROUND 😤 ME CACHE DB

LATENCY & FANOUT req : all tweets for #qcon ⇒ • what determines overall 99%-ile of SERVICE req ? tid 1, tid 2, …, tid n (assume n is large) fanout percentile 1 p99 10 p99 9 100 p99 99 CACHE CACHE CACHE 1000 p99 999

LATENCY & DEPENDENCY • what determines overall 99%-ile? SERVICE A • adding all latencies together get timeline get tweets SERVICE B • N steps ⇒ N x exposure to tail latency get users for each tweet SERVICE C

CACHE IS UBIQUITOUS SERVICE A • Exposure of cache tail CACHE A CACHE A CACHE A latency increases with both SERVICE B scale and dependency ! CACHE B CACHE B CACHE B SERVICE C CACHE C CACHE C CACHE C

GOOD CACHE PERFORMANCE = PREDICTABLE LATENCY

GOOD CACHE PERFORMANCE = PREDICTABLE TAIL LATENCY

KING OF PERFORMANCE “MILLIONS OF QPS PER MACHINE” “SUB-MILLISECOND LATENCIES” “NEAR LINE-RATE THROUGHPUT” …

GHOSTS OF PERFORMANCE “ USUALLY PRETTY FAST” “HICCUPS EVERY ONCE IN A WHILE ” “TIMEOUT SPIKES AT THE TOP OF THE HOUR ” “SLOW ONLY WHEN MEMORY IS LOW” …

I SPENT FIRST 3 MONTHS AT TWITTER LEARNING CACHE BASICS… …AND THE NEXT 5 YEARS CHASING GHOSTS

CHAINING DOWN GHOSTS = MINIMIZE INDETERMINISTIC BEHAVIOR

HOW? IDENTIFY AVOID MITIGATE

A PRIMER: CACHING IN DATACENTER

DATACENTER • geographically centralized • highly homogeneous network • relatively reliable infrastructure

CACHING MAINLY: REQUEST → RESPONSE INITIALLY: CONNECT ALSO (BECAUSE WE ARE GROWN-UPS): STATS, LOGGING, HEALTH CHECK…

CACHE SERVER: BIRD’S VIEW data protocol storage event-driven server OS HOST network infrastructure

HOW DID WE UNCOVER THE UNCERTAINTIES ?

“ BANDWIDTH UTILIZATION WENT WAY UP, EVEN THOUGH REQUEST RATE WAS WAY LOWER. ”

SYSCALLS

CONNECTING IS SYSCALL-HEAVY read 4+ syscalls accept config register event

REQUEST IS SYSCALL-LIGHT read IO post- event (read) read 3 syscalls* parse process compose write IO post- event (write) write *: event loop returns multiple read events at once, I/O syscalls can be further amortized by batching/pipelining

TWEMCACHE IS MOSTLY SYSCALLS • 1-2 µs overhead per call • dominate CPU time in simple cache • What if we have 100k conns / sec? source

culprit: CONNECTION STORM

“ …TWEMCACHE RANDOM HICCUPS, ALWAYS AT THE TOP OF THE HOUR. ”

cache t worker ⏱ l o g g i n g DISK O / I cron job “ x”

culprit: BLOCKING I/O

“ WE ARE SEEING SEVERAL “BLIPS” AFTER EACH CACHE REBOOT… ”

A TIMELINE MEMCACHE RESTART … lock! MANY REQUESTS TIMED OUT lock! CONNECTION STORM SOME MORE REQUESTS TIMED OUT (REPEAT A FEW TIMES)

culprit: LOCKING

LOCKING FACTS • ~25ns per operation • more expensive on NUMA • much more costly when contended source

“ HOSTS WITH LONG RUNNING TWEMCACHE/REDIS TRIGGER OOM DURING LOAD SPIKES. ”

“ REDIS INSTANCES THAT STARTED EVICTING SUDDENLY GOT SLOWER. ”

culprit: MEMORY LAYOUT / OPS

SUMMARY CONNECTION STORM BLOCKING I/O LOCKING MEMORY

HOW TO MITIGATE?

HIDE EXPENSIVE OPS PUT OPERATIONS OF DIFFERENT NATURE / PURPOSE ON SEPARATE THREADS

DATA PLANE, CONTROL PLANE

SLOW: CONTROL PLANE STATS AGGREGATION STATS EXPORTING LOG DUMP LOG ROTATION …

FAST: DATA PLANE / REQUEST read IO post- event (read) read t worker : parse process compose write IO post- event (write) write

FAST: DATA PLANE / CONNECT t server read accept config dispatch : event t worker read register : event

LATENCY-ORIENTED THREADING t worker REQUESTS new logging, connection stats update t server t admin CONNECTS OTHER logging, stats update

WHAT TO AVOID?

LOCKING

WHAT WE KNOW • inter-thread communication in cache t worker • stats new logging, • logging connection stats update • connection hand-off t server t admin • locking propagates blocking/delay logging, between threads stats update

LOCKLESS OPERATIONS MAKE STATS UPDATE LOCKLESS w/ atomic instructions

LOCKLESS OPERATIONS MAKE LOGGING LOCKLESS RING/CYCLIC BUFFER writer reader read write position position

LOCKLESS OPERATIONS MAKE CONNECTION HAND-OFF LOCKLESS … … RING ARRAY writer reader read write position position

MEMORY

WHAT WE KNOW • alloc-free cause fragmentation • internal vs external fragmentation • OOM/swapping is deadly • memory alloc/copy relatively expensive source

PREDICTABLE FOOTPRINT AVOID EXTERNAL FRAGMENTATION CAP ALL MEMORY RESOURCES

PREDICTABLE RUNTIME REUSE BUFFER PREALLOCATE

IMPLEMENTATION PELIKAN CACHE

WHAT IS PELIKAN CACHE? process • (Datacenter-) Caching framework server cache data model parse/compose/trace orchestration • A summary of Twitter’s cache ops data store request response • Perf goal: deterministically fast streams events • Clean, modular design poo ling • Open-source channels buffers timer alarm common core pelikan.io waitless logging lockless metrics composed config threading

PERFORMANCE DESIGN DECISIONS A COMPARISON latency-oriented Memory/ Memory/ Memory/ locking threading fragmentation buffer caching pre-allocation, cap partial internal partial partial yes Memcached no->partial external no partial no->yes Redis yes internal yes yes no Pelikan

TO BE FAIR… MEMCACHED REDIS • multiple threads can boost throughput • rich set of data structures • binary protocol + SASL • RDB • master-slave replication • redis-cluster • modules • tools

SCALABLE CACHE IS… ALWAYS FAST

“ CAREFUL ABOUT MOVING TO MULTIPLE WORKER THREADS ”

QUESTIONS?

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - PowerPoint PPT Presentation

IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of Twemcache & Twitters Redis fork operations of thousands of machines hundreds of (internal) customers Now working on

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen

CS161 Recursion Continued Tail recursion n Tail recursion is a recursive call that occurs as

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of

Race Condition Shared Data: 4 5 6 1 8 5 6 20 9 ? Synchronization and Deadlocks tail

Race Condition Shared Data: 5 6 4 1 8 5 6 20 9 ? InterProcess Communication tail A[]

CURB Connect, Unite, Resist, Begin CURB Royal Greenwichs response to preventing children

Enforcement Specific Vehicles December 4, 2019 Agenda Overview of curb management History

Curb Ramps for Accessibility Statewide Curb Ramp Accessibility Program Federal requirement -

Equivalent Axle Loads Mazda Miata Curb weight = 2300 lb 1 ( ) = d consumption per passage

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Lalith Suresh (TU

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

tail bounds tail bounds For a random variable X, the tails of X are the parts of the PMF/density

Probe or Wait : Handling tail losses using Multipath TCP Kiran Yedugundla, Per Hurtig, Anna

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

Authored by, Suyong Eum, Kiyohide Nakauchi, Yozo Shoji, Nozomu Nishinaga, Masayuki Murata It

DNS Session 2: DNS cache operation and DNS debugging Joe Abley AfNOG 2006 workshop How caching

Cache la Poudre River National Heritage Area Supported and Managed by the Poudre Heritage

An Introduction to An Introduction to Geocaching Geocaching Chris Kracik Chris Kracik aka

GEOCACHING MARKETING THE DESTINATION the sport where YOU are the search engine Link Agenda

A A Comprehensiv ive R Revie iew o of the Chal alle lenges an and Opportunit itie ies

Slide 2 Caching is both the most effective AND the most cost-effective method for schools to

DNS Rex Do you need an aggressive benchmark? Alex Rousskov The Measurement Factory DNS Rex At a