scaling instagram infra
play

SCALING INSTAGRAM INFRA Lisa Guo March 7th, 2017 lguo@instagram.com - PowerPoint PPT Presentation

SCALING INSTAGRAM INFRA Lisa Guo March 7th, 2017 lguo@instagram.com INSTAGRAM HISTORY 2010 2014/1 2012/4/9 2017 joined Facebook 600M users/month INSTAGRAM EVERYDAY 400 Million Users 4+ Billion likes 100 Million photo/video uploads


  1. SCALING INSTAGRAM INFRA Lisa Guo— March 7th, 2017 lguo@instagram.com

  2. INSTAGRAM HISTORY 2010 2014/1 2012/4/9 2017 joined Facebook 600M users/month

  3. INSTAGRAM EVERYDAY 400 Million Users 4+ Billion likes 100 Million photo/video uploads Top account: 110 Million followers

  4. SCALING MEANS Scale up Scale out Scale dev team

  5. SCALE OUT

  6. SCALE OUT

  7. SCALE OUT

  8. “Let’s all pray that Amazon gets everything sorted out in short order.”

  9. INSTAGRAM STACK Cassandra PostgreSQL Other Django Services memcache RabbitMQ Celery

  10. STORAGE VS. COMPUTING • Storage: needs to be consistent across data centers • Computing: driven by user tra ffi c, as needed basis

  11. SCALE OUT: STORAGE user, media, friendship etc

  12. SCALE OUT: STORAGE user, media, friendship etc Read Django Replica Write Master Replica

  13. SCALE OUT: STORAGE user, media, friendship etc Read Django Replica DC1 Write DC2 Master DC3 Replica

  14. SCALE OUT: STORAGE user feeds, activities etc Replica Write - 2 Read - 1 Replica Replica

  15. SCALE OUT: STORAGE user feeds, activities etc Replica Write - 2 Read - 1 Replica Replica

  16. COMPUTING

  17. DC1 DC2 memcache memcache Django Django PostgreSQL PostgreSQL RabbitMQ RabbitMQ Cassandra Cassandra Celery Celery

  18. MEMCACHE • High performance key-value store in memory • Millions of reads/writes per second • Sensitive to network condition • Cross region operation is prohibitive No global consistency

  19. DC1 User C User R comment feed Django Django insert set get PostgreSQL memcache

  20. DC2 DC1 User C User R comment feed Django Django set insert get replication memcache PostgreSQL PostgreSQL memcache

  21. DC2 DC1 User C User R comment feed Django Django set insert get replication memcache PostgreSQL PostgreSQL memcache Cache Cache invalidate invalidate

  22. COUNTERS select count(*) from user_likes_media where media_id=12345; 100s ms

  23. COUNTER select count from media_likes where media_id=12345; 10s us

  24. Cache invalidated All djangos try to access DB

  25. MEMCACHE LEASE time d1 d2 memcache db lease-get fill lease-get wait or use stale read from DB lease-set lease-get hit

  26. INSTAGRAM STACK - MULTI REGION DC1 DC2 Django Django memcache PostgreSQL PostgreSQL memcache RabbitMQ Cassandra Cassandra RabbitMQ Celery Celery

  27. SCALING OUT • Capacity • Reliability • Regional failure ready

  28. SCALING OUT - CHALLENGES, OPPORTUNITIES • Beyond North America • More localized social network • Direct messaging • Live streaming

  29. 100 80 60 40 20 0 2 4 6 8 10 12 14 16 18 20 22 24 User growth Server growth

  30. “Don’t count the servers, make the servers count”

  31. SCALE UP

  32. SCALE UP Use as few CPU instructions as possible Use as few servers as possible

  33. SCALE UP Use as few CPU instructions as possible Use as few servers as possible

  34. CPU Monitor Analyze Optimize

  35. COLLECT struct perf_event_attr pe; pe.type = PERF_TYPE_HARDWARE; pe.config = PERF_COUNT_HW_INSTRUCTIONS; fd = perf_event_open(&pe, 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long));

  36. DYNOSTATS 100 Explore 80 60 Feed 40 Follow 20 0 2 4 6 8 10 12 14 16 18 20 22 24

  37. REGRESSION 100 80 60 40 20 0 2 4 6 8 10 12 14 16 18 20 22 24

  38. With new feature Without new feature

  39. CPU Monitor Analyze Optimize

  40. PYTHON CPROFILE import cProfile, pstats, StringIO pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()

  41. CPU - ANALYZE continuous profiling generate_profile explore --start <start-time> --duration <minutes>

  42. CPU - ANALYZE 100 continuous profiling Caller 80 60 40 Callee 20 Callee 0 2 4 6 8 10 12 14 16 18 20 22 24

  43. CPU Monitor Analyze Optimize

  44. igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg

  45. igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s150x150/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s400x600/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s200x200/12345678_1234567890_987654321_a.jpg

  46. CPU - OPTIMIZE

  47. igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg 150x150 400x600 200x200

  48. CPU - OPTIMIZE C is really faster • Candidate functions: • Used extensively • Cython or C/C++ • Stable

  49. SCALE UP Use as few CPU instructions as possible Use as few servers as possible

  50. ONE WEB SERVER Shared Private Memory Memory Process N Process 1

  51. SCALE UP: MEMORY Reduce code • Run in optimized mode (-O) • Remove dead code

  52. SCALE UP: MEMORY Share more • Move configuration into shared memory • Disable garbage collection

  53. SCALE UP: MEMORY 20+% capacity increase

  54. SCALE UP: NETWORK LATENCY Synchronous processing model with long latency ===> Worker starvation and fewer CPU instr executed

  55. ASYNC IO Stories Feed Django Stories Feed Suggested Users

  56. Scale up Use as few CPU instructions as possible Use as few servers as possible

  57. SCALE UP: CHALLENGES, OPPORTUNITIES • Faster python run-time • Async web framework • Better memory analysis • etc etc

  58. SCALE DEV TEAM

  59. SCALING TEAM 30% engineers joined in last 6 months Intern - 12 weeks Hack-A-Month - 4 weeks Bootcampers - 1 week

  60. Saved Posts Multiple media in one post Comment Filtering Instagram Live Instagram First Story Stories Notification Windows App Video View Notification Self-harm Prevention

  61. Will Which server? I lock up DB? NewTable Will I bring down or New Column? Instagram? Should I cache it? What Index?

  62. WHAT WE WANT • Automatically handle cache • Define relations, not worry about implementations • Self service by product engineers • Infra focuses on scale

  63. liked by USER1 posted by likes media USER3 likes posted TAO USER2 liked by

  64. Saved Posts Multiple media in one post Comment Filtering Instagram Live Instagram First Story Stories Notification Windows App Video View Notification Self-harm Prevention

  65. SOURCE CONTROL Live Master Direct

  66. SOURCE CONTROL With branches • Context switching • Code sync/merge overhead • Surprises • Refactor/major upgrade • Performance tracking harder

  67. SOURCE CONTROL Live Master Direct

  68. SOURCE CONTROL Master Live Direct

  69. SOURCE CONTROL No branches • Continous integration • Collaborate easily • Fast bisect and revert • Continuous performance monitoring

  70. FEATURE LAUNCH Engineers Dogfooder Employees Some demographics World

  71. FEATURE LOAD TEST

  72. Once a week? day di ff !! 40-60 rollouts per day

  73. CHECKS AND BALANCES Code review Code accepted Canary To the Wild unittest committed

  74. SCALING MEANS Scale up Scale out Scale dev team

  75. TAKEAWAYS Scaling is continuous e ff ort Scaling is multi-dimensional Scaling is everybody’s responsibility

  76. QUESTIONS?

Recommend


More recommend