scaling instagram infra
play

SCALING INSTAGRAM INFRA Lisa Guo Nov 7th, 2016 lguo@instagram.com - PowerPoint PPT Presentation

SCALING INSTAGRAM INFRA Lisa Guo Nov 7th, 2016 lguo@instagram.com INSTAGRAM HISTORY 2012/4/3 2010 Android 2014/1 release 2012/4/9 2011 Facebook 14M users acquisition INSTAGRAM EVERYDAY 300 Million Users 4.2 Billion likes 95 Million


  1. SCALING INSTAGRAM INFRA Lisa Guo— Nov 7th, 2016 lguo@instagram.com

  2. INSTAGRAM HISTORY 2012/4/3 2010 Android 2014/1 release 2012/4/9 2011 Facebook 14M users acquisition

  3. INSTAGRAM EVERYDAY 300 Million Users 4.2 Billion likes 95 Million photo/video uploads 100 Million followers

  4. SCALING MEANS Scale up Scale out Scale dev team

  5. SCALE OUT

  6. SCALE OUT “To scale horizontally means to add more nodes to a system, such as adding a new computer to a distributed software application. An example might involve scaling out from one Web server system to three .” - Wikipedia

  7. MICROSERVICE

  8. SCALING OUT —> —> —> vertical partition horizontal sharding

  9. SCALING OUT

  10. INSTAGRAM STACK Cassandra PostgreSQL Other Django Services memcache RabbitMQ Celery

  11. STORAGE VS. COMPUTING • Storage: needs to be consistent across data centers • Computing: driven by user tra ffi c, as needed basis

  12. SCALE OUT: STORAGE user feeds, stories, activities, and other logs - Masterless - Async, low latency - Multiple data center ready - Tunable latency vs consistency trade-o ff

  13. SCALE OUT: STORAGE user, media, friendship etc • One master, replicas are in each region • Reads are done locally • Writes are cross region to the master.

  14. COMPUTING

  15. DC1 DC2 memcache memcache Django Django PostgreSQL PostgreSQL RabbitMQ RabbitMQ Cassandra Cassandra Celery Celery

  16. MEMCACHE • Millions of reads/writes per second • Sensitive to network condition • Cross region operation is prohibitive

  17. DC1 User C User R comment feed Django Django insert set get PostgreSQL memcache

  18. DC2 DC1 User C User R comment feed Django Django set insert get replication memcache PostgreSQL PostgreSQL memcache

  19. DC2 DC1 User C User R comment feed Django Django set insert set get replication memcache PostgreSQL PostgreSQL memcache Cache Cache invalidate invalidate

  20. COUNTERS select count(*) from user_likes_media where media_id=12345; 100s ms

  21. COUNTERS

  22. COUNTER select count from media_likes where media_id=12345; 10s us

  23. Cache invalidated All djangos try to access DB

  24. MEMCACHE LEASE time d1 d2 memcache db lease-get fill lease-get wait or use stale read from DB lease-set lease-get hit

  25. INSTAGRAM STACK - MULTI REGION DC1 DC2 Django Django memcache PostgreSQL PostgreSQL memcache RabbitMQ Cassandra Cassandra RabbitMQ Celery Celery

  26. SCALING OUT • Capacity • Reliability • Regional failure ready Requests/second

  27. LOAD TEST Loaded Django 100 Servers Servers Load 80 Balancer 60 40 Regular 20 0 2 4 6 8 10 12 14 16 18 20 22 24 CPU instructions

  28. 100 80 60 40 20 0 2 4 6 8 10 12 14 16 18 20 22 24 User growth Server growth

  29. “Don’t count the servers, make the servers count”

  30. SCALE UP

  31. SCALE UP Use as few CPU instructions as possible Use as few servers as possible

  32. SCALE UP Scale up Use as few CPU instructions as possible Use as few servers as possible

  33. CPU Monitor Analyze Optimize

  34. COLLECT struct perf_event_attr pe; pe.type = PERF_TYPE_HARDWARE; pe.config = PERF_COUNT_HW_INSTRUCTIONS; fd = perf_event_open(&pe, 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long));

  35. DYNOSTATS 100 Explore 80 60 Feed 40 Follow 20 0 2 4 6 8 10 12 14 16 18 20 22 24

  36. REGRESSION 100 80 60 40 20 0 2 4 6 8 10 12 14 16 18 20 22 24

  37. GRADUAL REGRESSION 100 80 60 40 20 0 0 2 4 6 8 10 12 14 16 18 20 22 24

  38. With new feature Without new feature

  39. CPU Monitor Analyze Optimize

  40. PYTHON CPROFILE import cProfile, pstats, StringIO pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()

  41. CPU - ANALYZE continuous profiling generate_profile explore --start <start-time> --duration <minutes>

  42. CPU - ANALYZE 100 continuous profiling Caller 80 60 40 Callee 20 0 2 4 6 8 10 12 14 16 18 20 22 24

  43. CPU - ANALYZE decorator @log_stats @log_stats def get_follows(): def get_photos(): …… …… def follow(): def feed(): get_follows() get_photos()

  44. feed follow log_stats get_follows get_photos

  45. Keeping Demand in Check feed follow get_photos get_follows

  46. CPU Monitor Analyze Optimize

  47. igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg

  48. igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s150x150/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s400x600/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s200x200/12345678_1234567890_987654321_a.jpg

  49. CPU - OPTIMIZE

  50. igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg 150x150 400x600 200x200

  51. CPU - OPTIMIZE C is really faster • Candidate functions: • Used extensively • Cython or C/C++ • Stable

  52. CPU - CHALLENGE cProfile is not free False positive alerts Better automation

  53. Scale up Use as few CPU instructions as possible Use as few servers as possible

  54. SCALE UP: MEMORY (memory budget /process) X (# of processes) < system memory Less memory budget/process ===> More processes ===> Dies sooner

  55. LOAD TEST Loaded Django 100 Servers Servers Load 80 Balancer 60 40 Regular 20 0 2 4 6 8 10 12 14 16 18 20 22 24 CPU instructions

  56. SCALE UP: MEMORY Code Large configuration

  57. SCALE UP: MEMORY • Run in optimized mode (-O) • Use shared memory • NUMA • Remove dead code

  58. SCALE UP: LATENCY Synchronous Processing model ===> Worker starvation Single service degradation ===> All user experience impacted Longer latency ===> Fewer CPU instr executed

  59. ASYNC IO Stories Feed Django Stories Feed Suggested Users

  60. Scale up Use as few CPU instructions as possible Use as few servers as possible

  61. SCALE DEV TEAM

  62. SCALING TEAM 30% engineers joined in last 6 months Bootcampers - 1 week Hack-A-Month - 4 weeks Intern - 12 weeks

  63. Save Draft Story Viewer Ranking Comment Filtering First Story Notification Windows App Video View Notification Self-harm Prevention

  64. Will Which server? I lock up DB? NewTable Will I bring down or New Column? Instagram? Should I cache it? What Index?

  65. WHAT WE WANT • Automatically handle cache • Define relations, not worry about implementations • Self service by product engineers • Infra focuses on scale this service

  66. liked by USER1 posted by likes media USER3 likes posted TAO USER2 liked by

  67. SCALE DEV - END OF POSTGRES

  68. SHIPPING LOVE >120 engineers committed code last month 60-80 daily di ff s

  69. RELEASE • Master, no branch • All features developed on master gated by configuration • No branch integration overhead • Continuous integration • No surprises • Iterate fast, collaborate easily • Fast bisect and revert

  70. Once a week? Once a di ff !! Once a day? 40-50 rollouts per day

  71. CHECKS AND BALANCES Code review Code accepted Dark launch Canary To the Wild unittest committed Load test

  72. TAKEAWAYS Scaling is a continuous e ff ort Scaling is multi-dimensional Scaling is everybody’s responsibility

  73. QUESTIONS?

Recommend


More recommend