MicroMon: A Monitoring Framework for Tackling Distributed - PowerPoint PPT Presentation

MicroMon: A Monitoring Framework for Tackling Distributed Heterogeneity Babar Khalid*+, Nolan Rudolph†+, Ramakrishnan Durairajan†, Sudarsun Kannan* *Rutgers University, †University of Oregon (+co-primary authors)

Background • Modern applications are increasingly becoming geo-distributed - e.g., Cassandra, Apache Spark • Geo-distributed datacenters (DCs) use heterogeneous resources - storage heterogeneity (e.g., SSD, NVMe, Harddisk) - WAN heterogeneity (e.g., fiber optics, InfiniBand) • Hardware heterogeneity in DCs avoids vendor lockout and reduces operational cost (by combining older/cheaper and newer/expensive hardware) • Careful provisioning can provide high performance at lower cost 2

Problem With Current Systems • Current monitoring frameworks for geo-distributed applications are unidimensional - can only monitor hosts, storage devices, networks in isolation • Lack hardware heterogeneity awareness - e.g. no awareness for storage heterogeneity - could impact I/O intensive applications • Coarse-granular monitoring - unaware of host-level micro-metrics in software and hardware - e.g. page cache, node-level I/O traffic, node’s network queue delays 3

Our Solution - MicroMon • MicroMon is a fined grained monitoring, dissemination, and inference framework • Collects fine-grained ( micrometrics ) software and hardware metrics in end-hosts and network - e.g., page cache utilization, disk read/write throughput in end host • Filters micrometrics into anomalies to efficiently disseminate • Enables replica selection for geo-distributed Cassandra • Preliminary study of Micromon integrated with geo-distributed Cassandra shows high throughput gains 4

Outline • Background • Case Study • Design • Evaluation • Conclusion 5

Case Study - Cassandra • Distributed NoSQL database system deployed geographically • Manages large amounts of structured data in commodity servers • Provides highly available service and no single point of failure • Typically focuses on availability and partition tolerance 6

Cassandra – Replication Node 1 Node 5 Node 2 Update (key) Node 4 Node 3 Client Cassandra Cluster 7

Cassandra – Replication Rack Awareness Rack 1 Node 1 Rack 1 Node 5 Node 2 Update (key) Rack 2 Node 4 Node 3 Client 8

Cassandra – Replication DC Awareness Rack 1 Rack 1 Node 1 Node 1 Rack 1 Rack 1 Node 5 Node 5 Node 2 Node 2 Update (key) Rack 2 Rack 2 Node 4 Node 3 Node 4 Node 3 Client DC: US DC: Europe 9

Cassandra’s Snitch Monitoring • Cassandra uses Snitch to monitor network topology and route requests across replicas • Also provides capability to spread replicas across DCs to avoid correlated failures • Snitch monitors (read) latencies to avoid non-responsive replicas • Different types: Gossiping, MultiRegionSnitch - Gossiping uses rack and datacenter information to gossip across nodes and collect latency information • Problem: No hardware heterogeneity awareness 10

Analysis Goal and Methodology Goal: Highlight the lack of heterogeneity awareness • Replica Configuration • - SSD Replica: Sequential storage b/w - 600MB/s, rand b/w: 180 MB/s - HDD replica: Sequential storage b/w - 120MB/s, rand b/w: 10 MB/s Network latency across replicas same (for this analysis) • Workload – YCSB benchmark • - workload A (50% read and writes) - workload B (95% reads) - workload C (100% reads) 11

Impact of Storage Heterogeneity Awareness 50000 HDD-only SSD-only Snitch 40000 OPS/sec 30000 20000 10000 0 A B C YCSB Workloads • Significant performance impact over optimal SSD-only configuration • Snitch: Lack of awareness to storage hardware heterogeneity 12

Our Design: MicroMon • Monitoring and inference framework for geo-distributed applications • Performs micro-metrics monitoring at the host and network-level - micro-metrics includes fine-grained software and hardware metrics • Efficiently disseminates collected micro-metrics • Ongoing - Distributed inference engines to guide application requests to the best replica 14

MicroMon Challenges • Selection Problem: What micrometrics to consider? • Dissemination Problem: How to send all micrometrics? • Inference Problem: How to quickly infer from micrometrics? 15

Design - Micrometrics Selection Huge combinations of micrometrics across app, host OS, and network • • Micrometrics could vary for different application-level metrics e.g. micrometrics for latency different than those for throughput • Our approach: Start with storage and network micrometrics • Identify hardware and software micrometrics using resource usage - e.g. high storage usage -> monitor page cache, read/write latency 16

MicroMon High-level Design Enterprise DC A Enterprise DC B Enterprise Backbone Networking stack Storage stack Networking stack micrometrics at switches micrometrics at DC micrometrics at DC ----- Ingress/Egress ----- Page cache (SW) ----- Transport ----- Port File system (SW) Flags (syn, ack, etc.) Server Packet count Block device driver (SW) Window size Byte count Hard disk (HW) Goodput Drop count Switch Bytes transmitted/received Utilization Round-trip time ----- Buffer ----- ----- Application ----- Collected Avg. queue length Throughput micrometrics Queue drop count Congestion status 17

Reducing Dissemination – Anomaly Reports • Problem: Prohibitive cost of dissemination across thousands of nodes - cost increases with hardware and software components - e.g., SSD’s SMART counters contain close to 32 counters • • Observation: OSes already expose anomalies (indirectly) - e.g. high I/O wait time of process -> higher page cache misses - e.g. sustained storage BW against max. hardware BW - e.g. network I/O queue wait time alludes to TCP congestion • Proposed Idea: Instead of sending thousands of micrometrics to decision agent, only report OS perceived anomalies 18

Reducing Dissemination - Network Telemetry • Network telemetry offers aggregated stats about state of the network • Idea: co-design in-band network telemetry (INT) with end host OS - monitor packets at end host with anomaly reports as payload - get network anomaly reports using INT • Pre-established anomaly thresholds reduce total aggregated stats further INT payload INT header Network End-host anomalies anomalies 19

Scalable Inference - Scoring-based Inference • Simple scoring–based inference in Cassandra - replicas sorted and ranked by network latency • Problem: for bandwidth sensitive applications, need higher weights for WAN-based micrometrics compared to host-level micrometrics • Our approach: - we assign equal weights to all software and hardware micrometrics - use collected micrometrics to calculate a replica score - route request to replicas with higher scores - flexibility to assign higher weights for WAN-based micrometrics • Ongoing: Designing a generic, self-adaptive inference engine 20

Evaluation Goals Goals: Understand the impact of storage heterogeneity with Micromon • Understand the impact of storage heterogeneity + network latency • Analyze the page cache impact (see paper for details) • 22

Analysis Methodology Multiple DCs from CloudLab Infrastructure • - three nodes located in UTAH, APT, and Emulab DCs Replica Configuration • - UTAH replica: NVMe storage (seq bw: 600MB/s, rand bw: 180 MB/s) - APT replica: HDD (seq bw: 120 MB/s, rand bw: 10 MB/s) - Emulab master node: HDD (same as above) Network Latencies • - 400us between UTAH (NVMe) replica and master node - 600us between APT (HDD) replica and master node Workload – YCSB benchmark • - workload A (50% read and writes) - workload B (95% reads) - workload C (100% reads) 23

MicroMon’s - Storage Heterogeneity 50000 HDD-only SSD-only Snitch MicroMon 40000 Ops/sec 30000 20000 10000 0 32 64 128 32 64 128 32 64 128 clients clients clients clients clients clients clients clients clients Workload A Workload B Workload C • Snitch lacks storage heterogeneity awareness • MicroMon’s storage heterogeneity awareness provides performance close to SSD-only (optimal) configuration Performance improves by up to 49% for large thread configuration • 24

Storage Heterogeneity + Network Latency Introduce network latency for SSD-only node • 9000 Throughput (ops/s) 7000 Snitch MicroMon 5000 3000 1000 -1000 0ms 1ms 2ms 5ms 10ms 15ms 25ms Network Latency For high network latencies (e.g., beyond 10ms) SSD benefits reduce • 25

Conclusion Datacenter systems are becoming more and more heterogeneous • Deploying geo-distributed applications in heterogeneous datacenters • requires redesign of monitoring mechanisms We propose MicroMon, a fine-grained micrometric monitoring, • dissemination, and inference framework Our on-going work will focus on efficient dissemination and self- • adaptive inference mechanisms 26

Thanks! Questions? Contact: sudarsun.kannan@rutgers.edu ram@cs.uoregon.edu 27

MicroMon: A Monitoring Framework for Tackling Distributed - PowerPoint PPT Presentation

MicroMon: A Monitoring Framework for Tackling Distributed Heterogeneity Babar Khalid+, Nolan Rudolph+, Ramakrishnan Durairajan, Sudarsun Kannan *Rutgers University, University of Oregon (+co-primary authors) Background Modern

Monitoring and Workflow management Monitoring and Workflow management in large distributed

Continuous Distributed Monitoring Monitoring A Short Survey Graham Cormode AT&T Labs

Tackling Europes health Tackling Europe s health priorities p Meeting of the European

Tackling social work student poverty IN EDUCATION RESEARCH POLICY Tackling Social Work

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

Marianne Boyle & Suzie Wall LBH Sport & Physical Activity Tackling Inactivity in the

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

2016 Coordinated Monitoring Schedule 1 Navigation of Coordinated Monitoring website

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Surveillance Programs - GLNPO Cooperative Monitoring Coordinated Science and Monitoring

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Coastal Monitoring Update Clive Moon Engineering Manager - Environment Coastal Monitoring

Chapter 14: Mass-Storage Systems Disk Structure Disk Scheduling Disk Management

Information Visualization Color Zipeng Liu, Tamara Munzner Department of Computer Science

Circumstellar Disks: Past, Present & Future Michael C. Liu Institute for Astronomy

CSE 127: Introduction to Security Lecture 10: Network Attacks Deian Stefan UCSD Winter 2020

Humberto Cabrera Venezuelan Institute for Scientific Research International Centre for

Optical illusions and their influence on machine vision Paula Cr ciun, AYIN team You dont

Disclosures Genetics for the general Chairman DSMB Sanofi gene ophthalmologist therapy

OEBB 2019-20 Open Enrollment: Vision Moda Health Vision Moda will continue offering three

MicroMon: A Monitoring Framework for Tackling Distributed - PowerPoint PPT Presentation

MicroMon: A Monitoring Framework for Tackling Distributed Heterogeneity Babar Khalid*+, Nolan Rudolph+, Ramakrishnan Durairajan, Sudarsun Kannan* *Rutgers University, University of Oregon (+co-primary authors) Background Modern

Monitoring and Workflow management Monitoring and Workflow management in large distributed

Continuous Distributed Monitoring Monitoring A Short Survey Graham Cormode AT&amp;T Labs

Tackling Europes health Tackling Europe s health priorities p Meeting of the European

Tackling social work student poverty IN EDUCATION RESEARCH POLICY Tackling Social Work

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

Marianne Boyle &amp; Suzie Wall LBH Sport &amp; Physical Activity Tackling Inactivity in the

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

2016 Coordinated Monitoring Schedule 1 Navigation of Coordinated Monitoring website

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Surveillance Programs - GLNPO Cooperative Monitoring Coordinated Science and Monitoring

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Coastal Monitoring Update Clive Moon Engineering Manager - Environment Coastal Monitoring

Chapter 14: Mass-Storage Systems Disk Structure Disk Scheduling Disk Management

Information Visualization Color Zipeng Liu, Tamara Munzner Department of Computer Science

Circumstellar Disks: Past, Present &amp; Future Michael C. Liu Institute for Astronomy

CSE 127: Introduction to Security Lecture 10: Network Attacks Deian Stefan UCSD Winter 2020

Humberto Cabrera Venezuelan Institute for Scientific Research International Centre for

Optical illusions and their influence on machine vision Paula Cr ciun, AYIN team You dont

Disclosures Genetics for the general Chairman DSMB Sanofi gene ophthalmologist therapy

OEBB 2019-20 Open Enrollment: Vision Moda Health Vision Moda will continue offering three

MicroMon: A Monitoring Framework for Tackling Distributed Heterogeneity Babar Khalid+, Nolan Rudolph+, Ramakrishnan Durairajan, Sudarsun Kannan *Rutgers University, University of Oregon (+co-primary authors) Background Modern

Continuous Distributed Monitoring Monitoring A Short Survey Graham Cormode AT&T Labs

Marianne Boyle & Suzie Wall LBH Sport & Physical Activity Tackling Inactivity in the

Circumstellar Disks: Past, Present & Future Michael C. Liu Institute for Astronomy