curb tail latency
play

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - PowerPoint PPT Presentation

IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of Twemcache (OSS), Twitters Redis fork operations of thousands of machines hundreds of (internal) customers Now working on


  1. IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN

  2. ABOUT ME • 6 years at Twitter, on cache • maintainer of Twemcache (OSS), Twitter’s Redis fork • operations of thousands of machines • hundreds of (internal) customers • Now working on Pelikan, a next-gen cache framework to replace the above @twitter • Twitter: @thinkingfish

  3. THE PROBLEM: CACHE PERFORMANCE

  4. CACHE RULES SERVICE EVERYTHING AROUND ME CACHE DB

  5. 😤 CACHE RUINS SERVICE EVERYTHING AROUND 😤 ME SENSITIVE! CACHE DB

  6. GOOD CACHE PERFORMANCE = PREDICTABLE LATENCY

  7. GOOD CACHE PERFORMANCE = PREDICTABLE TAIL LATENCY

  8. KING OF PERFORMANCE “MILLIONS OF QPS PER MACHINE” “SUB-MILLISECOND LATENCIES” “NEAR LINE-RATE THROUGHPUT” …

  9. GHOSTS OF PERFORMANCE “ USUALLY PRETTY FAST” “HICCUPS EVERY ONCE IN A WHILE ” “TIMEOUT SPIKES AT THE TOP OF THE HOUR ” “SLOW ONLY WHEN MEMORY IS LOW” …

  10. I SPENT FIRST 3 MONTHS AT TWITTER LEARNING CACHE BASICS… …AND THE NEXT 5 YEARS CHASING GHOSTS

  11. CONTAIN GHOSTS = MINIMIZE INDETERMINISTIC BEHAVIOR

  12. HOW? IDENTIFY AVOID MITIGATE

  13. A PRIMER: CACHING IN DATACENTER

  14. CONTEXT • geographically centralized • highly homogeneous network • reliable, predictable infrastructure • long-lived connections • high data rate • simple data/operations

  15. CACHE IN PRODUCTION MAINLY: REQUEST → RESPONSE INITIALLY: CONNECT ALSO (BECAUSE WE ARE ADULTS): STATS, LOGGING, HEALTH CHECK…

  16. CACHE: BIRD’S VIEW protocol data storage event-driven server OS HOST network infrastructure

  17. HOW DID WE UNCOVER THE UNCERTAINTIES ?

  18. “ BANDWIDTH UTILIZATION WENT WAY UP, BUT REQUEST RATE WAY DOWN. ”

  19. SYSCALLS

  20. CONNECTING IS SYSCALL-HEAVY read 4+ syscalls accept config register event

  21. REQUEST IS SYSCALL-LIGHT read IO post- event (read) read 3 syscalls* parse process compose write IO post- event (write) write *: event loop returns multiple read events at once, I/O syscalls can be further amortized by batching/pipelining

  22. TWEMCACHE IS MOSTLY SYSCALLS • 1-2 µs overhead per call • dominate CPU time in simple cache • What if we have 100k conns / sec? source

  23. culprit: CONNECTION STORM

  24. “ …TWEMCACHE RANDOM HICCUPS, ALWAYS AT THE TOP OF THE HOUR. ”

  25. cache t worker ⏱ l o g g i n g DISK O / I cron job “ x”

  26. culprit: BLOCKING I/O

  27. “ WE ARE SEEING SEVERAL “BLIPS” AFTER EACH CACHE REBOOT… ”

  28. LOCKING FACTS • ~25ns per operation • more expensive on NUMA • much more costly when contended source

  29. A TIMELINE MEMCACHE RESTART … lock! EVERYTHING IS FINE REQUESTS SUDDENLY GET SLOW/TIMED-OUT lock! CONNECTION STORM CLIENTS TOPPLE SLOWLY RECOVER (REPEAT A FEW TIMES) … STABILIZE

  30. culprit: LOCKING

  31. “ HOSTS WITH LONG RUNNING CACHE TRIGGERS OOM WHEN LOAD SPIKE. ”

  32. “ REDIS INSTANCES WERE KILLED BY SCHEDULER. ”

  33. culprit: MEMORY

  34. SUMMARY CONNECTION STORM BLOCKING I/O LOCKING MEMORY

  35. HOW TO MITIGATE?

  36. DATA PLANE, CONTROL PLANE

  37. HIDE EXPENSIVE OPS PUT OPERATIONS OF DIFFERENT NATURE / PURPOSE ON SEPARATE THREADS

  38. SLOW: CONTROL PLANE LISTENING (ADMIN CONNECTIONS) STATS AGGREGATION STATS EXPORTING LOG DUMP

  39. FAST: DATA PLANE / REQUEST read IO post- event (read) read t worker : parse process compose write IO post- event (write) write

  40. FAST: DATA PLANE / CONNECT t server read accept config dispatch : event t worker read register : event

  41. LATENCY-ORIENTED THREADING t worker REQUESTS new logging, connection stats update t server t admin CONNECTS OTHER logging, stats update

  42. WHAT TO AVOID?

  43. LOCKING

  44. WHAT WE KNOW • inter-thread communication in cache t worker • stats new logging, • logging connection stats update • connection hand-off t server t admin • locking propagates blocking/delay logging, between threads stats update

  45. LOCKLESS OPERATIONS MAKE STATS UPDATE LOCKLESS w/ atomic instructions

  46. LOCKLESS OPERATIONS MAKE LOGGING WAITLESS RING/CYCLIC BUFFER writer reader read write position position

  47. LOCKLESS OPERATIONS MAKE CONNECTION HAND-OFF LOCKLESS … … RING ARRAY writer reader read write position position

  48. MEMORY

  49. WHAT WE KNOW • alloc-free cause fragmentation • internal vs external fragmentation • OOM/swapping is deadly • memory alloc/copy relatively expensive source

  50. PREDICTABLE FOOTPRINT AVOID EXTERNAL FRAGMENTATION CAP ALL MEMORY RESOURCES

  51. PREDICTABLE RUNTIME REUSE BUFFER PREALLOCATE

  52. IMPLEMENTATION PELIKAN CACHE

  53. WHAT IS PELIKAN CACHE? process • (Datacenter-) Caching framework server cache data model parse/compose/trace orchestration • A summary of Twitter’s cache ops data store request response • Perf goal: deterministically fast streams events • Clean, modular design poo ling • Open-source channels buffers timer alarm common core pelikan.io waitless logging lockless metrics composed config threading

  54. PERFORMANCE DESIGN DECISIONS A COMPARISON latency-oriented Memory/ Memory/ Memory/ locking threading fragmentation buffer caching pre-allocation, cap partial internal partial partial yes Memcached no->partial external no partial no->yes Redis yes internal yes yes no Pelikan

  55. TO BE FAIR… MEMCACHED REDIS • multiple worker threads • rich set of data structures • binary protocol + SASL • master-slave replication • redis-cluster • modules • tools

  56. THE BEST CACHE IS… ALWAYS FAST

  57. QUESTIONS?

Recommend


More recommend