IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN
ABOUT ME • 6 years at Twitter, on cache • maintainer of Twemcache & Twitter’s Redis fork • operations of thousands of machines • hundreds of (internal) customers • Now working on Pelikan, a next-gen cache framework to replace the above @twitter • Twitter: @thinkingfish
THE PROBLEM: CACHE PERFORMANCE
CACHE RULES SERVICE EVERYTHING AROUND ME CACHE DB
😤 CACHE RUINS SERVICE EVERYTHING AROUND 😤 ME CACHE DB
LATENCY & FANOUT req : all tweets for #qcon ⇒ • what determines overall 99%-ile of SERVICE req ? tid 1, tid 2, …, tid n (assume n is large) fanout percentile 1 p99 10 p99 9 100 p99 99 CACHE CACHE CACHE 1000 p99 999
LATENCY & DEPENDENCY • what determines overall 99%-ile? SERVICE A • adding all latencies together get timeline get tweets SERVICE B • N steps ⇒ N x exposure to tail latency get users for each tweet SERVICE C
CACHE IS UBIQUITOUS SERVICE A • Exposure of cache tail CACHE A CACHE A CACHE A latency increases with both SERVICE B scale and dependency ! CACHE B CACHE B CACHE B SERVICE C CACHE C CACHE C CACHE C
GOOD CACHE PERFORMANCE = PREDICTABLE LATENCY
GOOD CACHE PERFORMANCE = PREDICTABLE TAIL LATENCY
KING OF PERFORMANCE “MILLIONS OF QPS PER MACHINE” “SUB-MILLISECOND LATENCIES” “NEAR LINE-RATE THROUGHPUT” …
GHOSTS OF PERFORMANCE “ USUALLY PRETTY FAST” “HICCUPS EVERY ONCE IN A WHILE ” “TIMEOUT SPIKES AT THE TOP OF THE HOUR ” “SLOW ONLY WHEN MEMORY IS LOW” …
I SPENT FIRST 3 MONTHS AT TWITTER LEARNING CACHE BASICS… …AND THE NEXT 5 YEARS CHASING GHOSTS
CHAINING DOWN GHOSTS = MINIMIZE INDETERMINISTIC BEHAVIOR
HOW? IDENTIFY AVOID MITIGATE
A PRIMER: CACHING IN DATACENTER
DATACENTER • geographically centralized • highly homogeneous network • relatively reliable infrastructure
CACHING MAINLY: REQUEST → RESPONSE INITIALLY: CONNECT ALSO (BECAUSE WE ARE GROWN-UPS): STATS, LOGGING, HEALTH CHECK…
CACHE SERVER: BIRD’S VIEW data protocol storage event-driven server OS HOST network infrastructure
HOW DID WE UNCOVER THE UNCERTAINTIES ?
“ BANDWIDTH UTILIZATION WENT WAY UP, EVEN THOUGH REQUEST RATE WAS WAY LOWER. ”
SYSCALLS
CONNECTING IS SYSCALL-HEAVY read 4+ syscalls accept config register event
REQUEST IS SYSCALL-LIGHT read IO post- event (read) read 3 syscalls* parse process compose write IO post- event (write) write *: event loop returns multiple read events at once, I/O syscalls can be further amortized by batching/pipelining
TWEMCACHE IS MOSTLY SYSCALLS • 1-2 µs overhead per call • dominate CPU time in simple cache • What if we have 100k conns / sec? source
culprit: CONNECTION STORM
“ …TWEMCACHE RANDOM HICCUPS, ALWAYS AT THE TOP OF THE HOUR. ”
cache t worker ⏱ l o g g i n g DISK O / I cron job “ x”
culprit: BLOCKING I/O
“ WE ARE SEEING SEVERAL “BLIPS” AFTER EACH CACHE REBOOT… ”
A TIMELINE MEMCACHE RESTART … lock! MANY REQUESTS TIMED OUT lock! CONNECTION STORM SOME MORE REQUESTS TIMED OUT (REPEAT A FEW TIMES)
culprit: LOCKING
LOCKING FACTS • ~25ns per operation • more expensive on NUMA • much more costly when contended source
“ HOSTS WITH LONG RUNNING TWEMCACHE/REDIS TRIGGER OOM DURING LOAD SPIKES. ”
“ REDIS INSTANCES THAT STARTED EVICTING SUDDENLY GOT SLOWER. ”
culprit: MEMORY LAYOUT / OPS
SUMMARY CONNECTION STORM BLOCKING I/O LOCKING MEMORY
HOW TO MITIGATE?
HIDE EXPENSIVE OPS PUT OPERATIONS OF DIFFERENT NATURE / PURPOSE ON SEPARATE THREADS
DATA PLANE, CONTROL PLANE
SLOW: CONTROL PLANE STATS AGGREGATION STATS EXPORTING LOG DUMP LOG ROTATION …
FAST: DATA PLANE / REQUEST read IO post- event (read) read t worker : parse process compose write IO post- event (write) write
FAST: DATA PLANE / CONNECT t server read accept config dispatch : event t worker read register : event
LATENCY-ORIENTED THREADING t worker REQUESTS new logging, connection stats update t server t admin CONNECTS OTHER logging, stats update
WHAT TO AVOID?
LOCKING
WHAT WE KNOW • inter-thread communication in cache t worker • stats new logging, • logging connection stats update • connection hand-off t server t admin • locking propagates blocking/delay logging, between threads stats update
LOCKLESS OPERATIONS MAKE STATS UPDATE LOCKLESS w/ atomic instructions
LOCKLESS OPERATIONS MAKE LOGGING LOCKLESS RING/CYCLIC BUFFER writer reader read write position position
LOCKLESS OPERATIONS MAKE CONNECTION HAND-OFF LOCKLESS … … RING ARRAY writer reader read write position position
MEMORY
WHAT WE KNOW • alloc-free cause fragmentation • internal vs external fragmentation • OOM/swapping is deadly • memory alloc/copy relatively expensive source
PREDICTABLE FOOTPRINT AVOID EXTERNAL FRAGMENTATION CAP ALL MEMORY RESOURCES
PREDICTABLE RUNTIME REUSE BUFFER PREALLOCATE
IMPLEMENTATION PELIKAN CACHE
WHAT IS PELIKAN CACHE? process • (Datacenter-) Caching framework server cache data model parse/compose/trace orchestration • A summary of Twitter’s cache ops data store request response • Perf goal: deterministically fast streams events • Clean, modular design poo ling • Open-source channels buffers timer alarm common core pelikan.io waitless logging lockless metrics composed config threading
PERFORMANCE DESIGN DECISIONS A COMPARISON latency-oriented Memory/ Memory/ Memory/ locking threading fragmentation buffer caching pre-allocation, cap partial internal partial partial yes Memcached no->partial external no partial no->yes Redis yes internal yes yes no Pelikan
TO BE FAIR… MEMCACHED REDIS • multiple threads can boost throughput • rich set of data structures • binary protocol + SASL • RDB • master-slave replication • redis-cluster • modules • tools
SCALABLE CACHE IS… ALWAYS FAST
“ CAREFUL ABOUT MOVING TO MULTIPLE WORKER THREADS ”
QUESTIONS?
Recommend
More recommend