The Hardware & Software Implications of Microservices and How Big Data Can Help Christina Delimitrou Cornell University with Yu Gan, Yanqi Zhang, Shuang Chen, Neeraj Kulkarni, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Ankitha Shetty, Nayan Katarki, Brett Clancy, Chris Colen, Dailun Cheng, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Meghna Pancholi, Siyuan Hu ASBD Workshop – June 2 nd 2018
Executive Summary ¨ Shift from monoliths to microservices: ¤ Modularity, specialization, simplicity, accelerated development ¤ Change assumptions about datacenter server design ¤ Complicate scheduling and resource management ¤ Amplify tail@scale effects ¨ Revisit architectural design decisions for microservices ¨ Highlight management challenges of microservices ¨ Motivate the need for data-driven approaches for systems whose scale & complexity keeps increasing 2
From Monoliths to Microservices 3
Motivation ¨ Advantages of microservices: ¤ Ease & speed of code development & deployment ¤ Security, error isolation ¤ PL/framework heterogeneity ¨ Challenges of microservices: ¤ Change server design assumptions ¤ Complicate resource management à dependencies ¤ Amplify tail-at-scale effects ¤ More sensitive to performance unpredictability ¤ No representative end-to-end apps with microservices 4
An End-to-End Suite for Cloud & IoT Microservices ¨ 4 end-to-end applications using popular open-source microservices à ~30-40 microservices per app ¤ Social Network ¤ Movie Reviewing/Renting/Streaming ¤ E-commerce ¤ Drone control service ¨ Programming languages and frameworks: ¤ node.js, Python, C/C++, Java/Javascript, Scala, PHP , and Go ¤ Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian ¤ Apache Thrift RPC, RESTful APIs ¤ Docker containers ¤ Lightweight RPC-level distributed tracing 5
Client Movie Streaming http Load Balancer http Front- nginx end fastcgi php-fpm Compose text uniqueID movieID AssignRating Phase All arrows are Thrift RPCs ComposeReview Login Store UpdateUser ReviewStorage UserReviewDB MovieReviewDB UpdateMovie Phase memcached memcached memcached memcached mongoDB mongoDB mongoDB mongoDB 6
Movie Streaming ¨ Browse movie info (movie plot, photos, videos, cast, stats, etc.) ¨ ML widgets: n Recommender for movies to watch n Recommender for ads ¨ User authentication/Payment ¨ Search: n Xapian: search movie DB ¨ Analytics: n Mahout: user analytics based on input stored in HDFS n Spark MLlib: in-memory ML analytics 7
Architectural Implications [CAL’18] ¨ Big vs. small servers: ¤ Power management using RAPL ¤ More pressure on single-thread performance, low tail latency 8
Architectural Implications Movie Streaming Social Network E-commerce IoT - Cloud Tail Latency (msec) Tail Latency (msec) 50 180 12 25 Tail Latency (sec) Tail Latency (sec) 45 160 Xeon 10 40 20 140 35 ThunderX 120 8 15 30 100 6 25 80 10 4 20 60 15 40 2 5 10 20 0 5 0 0 0 50 100 150 200 250 250 0 100 200 300 400 500 250 0 20 40 60 80 100 0 50 100 150 200 250 QPS QPS QPS QPS ¨ Big vs. small servers: ¤ Power management using RAPL ¤ More pressure on single-thread performance, low tail latency ¤ Low-power SoCs, e.g., Cavium ThunderX2 ¤ Similar latency, but earlier saturation 9
Architectural Implications Movie Streaming 12 Tail Latency (msec) Application proc 10 TCP proc (RPCs) 8 6 4 2 0 ProcText nginx AssignR MovieID ReviewID Compose RevStorage UserReview MovReview memcached mongoDB End-to-End Monolith ¨ Computation:Communication ratio: ¤ Monolithic service à 70:30 @ high load ¤ Microservices à 50:50 @ high load 10
Architectural Implications DRAM DRAM DRAM QPI PCIe Gen3 10Gbps CPU CPU Virtex7 QSFP QSFP PCIe Gen3 NIC QSFP 10Gbps ¨ Computation:Communication ratio: ¤ Monolithic service à 70:30 @ high load ¤ Microservices à 50:50 @ high load ¤ RPC/REST acceleration à NIC offloads, FPGAs 11
Architectural Implications Social Network E-Commerce 40 45 40 35 35 30 30 L1i MPKI L1i MPKI 25 25 20 20 15 15 10 10 5 5 0 0 frontend login orders search cart wishlist catalogue recommend shipping payment invoice qMaster mem$ mongodb nginx text image msgID tagUser urlShorten compose video msgStore wrTimeline wrGraph mem$ mongodb ¨ L1-i cache pressure: ¤ Monoliths à Large code footprints à L1i thrashing ¤ Microservices à Small footprint/microservice n Assuming dedicated cores 12
End-to-End Latency Breakdown ¨ Post-rightsizing (resource ratios to avoid glaring bottlenecks) ¨ Bottlenecks shift with load ¨ Need online, dynamic decisions 13
Resource Management Implications Netflix Twitter Movie Streaming Amazon ¨ Challenges of microservices: ¤ Change server design assumptions ¤ Dependencies complicate resource management 14
Dependencies & Backpressure read <k,v> http1 nginx nginx nginx nginx nginx nginx nginx nginx mem$ 15
Dependencies & Backpressure read <k,v> http1 nginx nginx nginx nginx mem$ mem$ mem$ mem$ nginx nginx nginx nginx nginx mem$ mem$ TX RX RX http2 nginx mem$ ¨ Traditional techniques like autoscale may help/penalize the wrong microservice ¨ Dependencies change at runtime à difficult to infer impact 16
Determine Per-Tier QoS ¨ Queueing models QoS 2 nginx mem$ QoS 1 ¨ Queueing network simulation ¤ Complex microservices graphs, blocking, cyclic dependencies, etc. 17
Power Management for Microservices ¨ Two types of latency slack: ¤ Microservices off the critical path ¤ Microservices on the critical path with relaxed QoS Frequency Frequency Frequency 2.2GHz End-to-end Latency End-to-end Latency End-to-end Latency QoS Utilization Utilization Utilization 100 0 time time 18 time
Scalability Challenges ¨ Determine per-tier QoS for 1000s of microservices à intractable ¨ Put visceral graph here… 19
Tail at Scale Effects ¨ Microservices add an extra dimension to tail at scale effects ¤ A single slow microservice affects end-to-end latency ¤ Much more pressure on performance predictability & availability ¤ Monitoring at the edge ¨ Determining per-tier QoS for 10000s of microservices is intractable ¤ Scalable data-driven approach ¨ Need for online performance debugging 20
Proactive Performance Debugging ¨ Dependencies between microservices à propagate & amplify QoS violations ¤ Finding the culprit of a QoS violation is difficult ¤ Post-QoS violation, returning to nominal operation is hard ¨ Anticipating QoS violations & identifying culprits ¨ Seer: Data-driven Performance Debugging for Microservices ¤ Combines lightweight RPC-level distributed tracing with hardware monitoring ¤ Leverages scalable deep learning to signal QoS violations with enough slack to apply corrective action 21
Queue CPU Mem Net Disk Performance Implications 22
Queue CPU Mem Net Disk Performance Implications 23
Seer: Data-Driven Performance Debugging [HotCloud’18] ¨ Leverage the massive amount of traces collected over time Apply online, practical data mining techniques that 1. identify the culprit of an upcoming QoS violation Use per-server hardware monitoring to determine the 2. cause of the QoS violation Take corrective action to prevent the QoS violation from 3. occurring ¨ Need to predict 100s of msec – a few sec in the future 24
Gantt charts microservices Tracing Framework Client http latency ¨ RPC level tracing […] ¨ Based on Apache Thrift WebUI ¨ Timestamp start-end QueryEngine for each microservice TCP TCP proc RX zTracer App proc ¨ Store in centralized DB Proc Cassandra TCP proc TX TCP (Cassandra) uService K RPC time TX ¨ Record all requests à Tracing RPC time RX Collector No sampling TCP ¨ Overhead: <0.1% in zTracer Proc throughput and <0.2% TCP uService K+1 in tail latency […] 25
Deep Learning to the Rescue ¨ Why? ¤ Architecture-agnostic ¤ Adjusts to changes in dependencies over time ¤ High accuracy, good scalability ¤ Inference within the required window 26
DNN Configuration Input Output signal signal Which ¨ Container microservice utilization will cause a QoS violation ¨ Latency in the near future? ¨ Queue depth 27
DNN Configuration Input Output signal signal Which ¨ Container microservice utilization will cause a QoS violation ¨ Latency in the near future? ¨ Queue depth 28
DNN Configuration ¨ Training once: slow (hours - days) ¤ Across load levels, load distributions, request types ¤ Distributed queue traces, annotated with QoS violations ¤ Weight/bias inference with SGD ¤ Retraining in the background ¨ Inference continuously: streaming trace data 93% accuracy in signaling upcoming QoS violations 91% accuracy in attributing QoS violation to correct microservice 29
DNN Configuration Accuracy stable or increasing with cluster size ¨ Challenges: ¤ In large clusters inference too slow to prevent QoS violations ¤ Offload on TPUs, 10-100x improvement; 10ms for 90 th %ile inference ¤ Fast enough for most corrective actions to take effect (net bw partitioning, RAPL, cache partitioning, scale-up/out, etc.) 30
Recommend
More recommend