Seer: Leveraging Big Data to Navigate The Increasing Complexity of - PowerPoint PPT Presentation

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud– July 9 th 2018

Executive Summary ¨ Microservices puts more pressure on performance predictability ¤ Microservices dependencies à propagate & amplify QoS violations ¤ Finding the culprit of a QoS violation is difficult ¤ Post-QoS violation, returning to nominal operation is hard ¨ Anticipating QoS violations & identifying culprits ¨ Seer: Data-driven Performance Debugging for Microservices ¤ Combines lightweight RPC-level distributed tracing with hardware monitoring ¤ Leverages scalable deep learning to signal QoS violations with enough slack to apply corrective action 2

From Monoliths to Microservices 3

Motivation ¨ Advantages of microservices: ¤ Ease & speed of code development & deployment ¤ Security, error isolation ¤ PL/framework heterogeneity ¨ Challenges of microservices: ¤ Change server design assumptions ¤ Complicate resource management à dependencies ¤ Amplify tail-at-scale effects ¤ More sensitive to performance unpredictability ¤ No representative end-to-end apps with microservices 4

An End-to-End Suite for Cloud & IoT Microservices ¨ 4 end-to-end applications using popular open-source microservices à ~30-40 microservices per app ¤ Social Network ¤ Movie Reviewing/Renting/Streaming ¤ E-commerce ¤ Drone control service ¨ Programming languages and frameworks: ¤ node.js, Python, C/C++, Java/Javascript, Scala, PHP , and Go ¤ Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian ¤ Apache Thrift RPC, RESTful APIs ¤ Docker containers ¤ Lightweight RPC-level distributed tracing 5

Resource Management Implications Netflix Twitter Movie Streaming Amazon ¨ Challenges of microservices: ¤ Dependencies complicate resource management ¤ Dependencies change over time à difficult for users to express ¤ Amplify tail@scale effects 6

The Need for Proactive Performance Debugging ¨ Detecting QoS violations after they occur: ¤ Unpredictable performance propagates through system ¤ Long time until return to nominal operation ¤ Does not scale 7

Queue CPU Mem Net Disk Performance Implications 8

Queue CPU Mem Net Disk Performance Implications 9

Seer: Data-Driven Performance Debugging ¨ Leverage the massive amount of traces collected over time Apply online, practical data mining techniques that 1. identify the culprit of an upcoming QoS violation Use per-server hardware monitoring to determine the 2. cause of the QoS violation Take corrective action to prevent the QoS violation from 3. occurring ¨ Need to predict 100s of msec – a few sec in the future 10

Gantt charts microservices Tracing Framework Client http latency ¨ RPC level tracing […] ¨ Based on Apache Thrift WebUI ¨ Timestamp start-end QueryEngine for each microservice TCP TCP proc RX zTracer App proc ¨ Store in centralized DB Proc Cassandra TCP proc TX TCP (Cassandra) uService K RPC time TX ¨ Record all requests à Tracing RPC time RX Collector No sampling TCP ¨ Overhead: <0.1% in zTracer Proc throughput and <0.2% TCP uService K+1 in tail latency […] 11

Deep Learning to the Rescue ¨ Why? ¤ Architecture-agnostic ¤ Adjusts to changes in dependencies over time ¤ High accuracy, good scalability ¤ Inference within the required window 12

DNN Configuration Input Output signal signal Which ¨ Container microservice utilization will cause a QoS violation ¨ Latency in the near future? ¨ Queue depth 13

DNN Configuration Input Output signal signal Which ¨ Container microservice utilization will cause a QoS violation ¨ Latency in the near future? ¨ Queue depth 14

DNN Configuration ¨ Training once: slow (hours - days) ¤ Across load levels, load distributions, request types ¤ Distributed queue traces, annotated with QoS violations ¤ Weight/bias inference with SGD ¤ Retraining in the background ¨ Inference continuously: streaming trace data 93% accuracy in signaling upcoming QoS violations 91% accuracy in attributing QoS violation to correct microservice 15

DNN Configuration Accuracy stable or increasing with cluster size ¨ Challenges: ¤ In large clusters inference too slow to prevent QoS violations ¤ Offload on TPUs, 10-100x improvement; 10ms for 90 th %ile inference ¤ Fast enough for most corrective actions to take effect (net bw partitioning, RAPL, cache partitioning, scale-up/out, etc.) 16

Experimental Setup ¨ 40 dedicated servers ¨ ~1000 single-concerned containers ¨ Machine utilization 80-85% ¨ Inject interference to cause QoS violation ¤ Using microbenchmarks (CPU, cache, memory, network, disk I/O) 17

Restoring QoS ¨ Identify cause of QoS violation ¤ Private cluster: performance counters & utilization monitors ¤ Public cluster: contentious microbenchmarks ¨ Adjust resource allocation ¤ RAPL (fine-grain DVFS) & scale-up for CPU contention ¤ Cache partitioning (CAT) for cache contention ¤ Memory capacity partitioning for memory contention ¤ Network bandwidth partitioning (HTB) for net contention ¤ Storage bandwidth partitioning for I/O contention 18

Restoring QoS ¨ Post-detection, baseline system à dropped requests ¨ Post-detection, Seer à maintain nominal performance 19

Demo Queue CPU Mem Net Disk 20

Challenges Ahead Serverless microservices IoT swarms ¨ Security implications of data-driven approaches ¨ Fall-back mechanisms when ML goes wrong ¨ Not a single-layer solution à Predictability needs vertical approaches Thank you! 22

Seer: Leveraging Big Data to Navigate The Increasing Complexity of - PowerPoint PPT Presentation

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud July 9 th 2018 Executive Summary

Accessing and Understanding Tribal-level Health Statistics Using SEER & SEER*Stat Ally

Seer: Leveraging Big Data to Navigate The Complexity of Cloud Debugging Yu Gan, Meghna Pancholi,

Surveillance, Epidemiology, and End Results (SEER) Program SEER Progress Report to the BSA

NCSEA Structural Engineer Emergency Response (SEER) Safety Assessment Program 2017 Deployments

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues , Paolo

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Big Cities, Big Data, Big Lessons! Leveraging Multi-Sector Data in Public Health to Address

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Navigate Client Account Access Navigate offers the ability to manage and view your accounts

FERPA and NAVIGATE FERPA and NAVIGATE Office of Academic Affairs 2 FERPA FERPA: refers to

http://cs224w.stanford.edu How to organize/navigate it? How to organize/navigate it?

Data Consortium: Data Consortium: Leveraging Kansas health data to advance Leveraging Kansas

Leveraging Big Data for Inclusive Insurance Manoj Chiba manoj@i2ifacility.org Breakfast Meeting

Josh Dass Radiation Oncologist Epidemiology At diagnosis: Jemal et al: Prognosis: Homer: SEER

Browser Enhancements to Help Improve Page Load Performance Using Delta Delivery W3C Performance

NEWS FLASH!! According to our very young Danny Yagan We are #1 in upward mobility

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Prt t t

Platform- -Based Synthesis for Based Synthesis for Platform Field Field- -Programmable

Wide Area Distributed File Systems Tevfik Kosar, Ph.D. Week 1: January 16, 2013 Data Deluge

Microscopic Estimation of Freeway Vehicle Positions Using Mobile Sensors Noah J. Goodall, P.E.

SIBRA Cristina Basescu, Raphael M. Reischuk , Pawel Szalachowski, Adrian Perrig, Yao Zhang,

Seer: Leveraging Big Data to Navigate The Increasing Complexity of - PowerPoint PPT Presentation

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud July 9 th 2018 Executive Summary

Accessing and Understanding Tribal-level Health Statistics Using SEER &amp; SEER*Stat Ally

Seer: Leveraging Big Data to Navigate The Complexity of Cloud Debugging Yu Gan, Meghna Pancholi,

Surveillance, Epidemiology, and End Results (SEER) Program SEER Progress Report to the BSA

NCSEA Structural Engineer Emergency Response (SEER) Safety Assessment Program 2017 Deployments

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues , Paolo

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Big Cities, Big Data, Big Lessons! Leveraging Multi-Sector Data in Public Health to Address

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Navigate Client Account Access Navigate offers the ability to manage and view your accounts

FERPA and NAVIGATE FERPA and NAVIGATE Office of Academic Affairs 2 FERPA FERPA: refers to

http://cs224w.stanford.edu How to organize/navigate it? How to organize/navigate it?

Data Consortium: Data Consortium: Leveraging Kansas health data to advance Leveraging Kansas

Leveraging Big Data for Inclusive Insurance Manoj Chiba manoj@i2ifacility.org Breakfast Meeting

Josh Dass Radiation Oncologist Epidemiology At diagnosis: Jemal et al: Prognosis: Homer: SEER

Browser Enhancements to Help Improve Page Load Performance Using Delta Delivery W3C Performance

NEWS FLASH!! According to our very young Danny Yagan We are #1 in upward mobility

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Prt t t

Platform- -Based Synthesis for Based Synthesis for Platform Field Field- -Programmable

Wide Area Distributed File Systems Tevfik Kosar, Ph.D. Week 1: January 16, 2013 Data Deluge

Microscopic Estimation of Freeway Vehicle Positions Using Mobile Sensors Noah J. Goodall, P.E.

SIBRA Cristina Basescu, Raphael M. Reischuk , Pawel Szalachowski, Adrian Perrig, Yao Zhang,

Accessing and Understanding Tribal-level Health Statistics Using SEER & SEER*Stat Ally