Staghorn An Automated Large-Scale Distributed System Analysis - PowerPoint PPT Presentation

CPU Quiesce Time A vm_density 1 10 14 6 VM 1 VM 2 Time to quiesce CPUs (in ms) A 5 (paused) (paused) ● 4 VM 1 VM 2 3 X ● 2 1 10 14 VM 1 VM 2 VM Density Staghorn An Automated Large-Scale Distributed System Analysis Platform Kasimir Gabert (5638), Ian Burns (9526), Steven Elliott (5634), Jenna Kallaher (5632), Adam Vail (5634) Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration und er contract DE-NA0003525. SAND2017-8419 C

Problem  Large, distributed systems have become ubiquitous  A common method for understanding their behavior is to simply run them and observe / experiment (Emulytics)  This necessarily competes with the model for CPU time, and the model and analysis must run at clock rate  We built a way to “stop time” within a model, opening the door to the larger world of offline model analysis 2

A Few Use Cases  Vulnerability analysis  Debugging systems  Optimizing tests  Experimental repeatability  Training 3

Key Contributions  A full- system snapshot and restore capability for Sandia’s large-scale emulation-based model environments which preserves network and I/O state  A network modification system that allows for modification of Ethernet frame contents and delivery, or the introduction and removal of frames, during a snapshot  The evaluation of this capability on real-world use-cases 4

Design Requirements  The system must not perceive that a snapshot has occurred  Staghorn must preserve machine and network state  Staghorn must snapshot quickly so that each virtual machine is snapshotted within a tight time window 5

Firewheel  Staghorn is built on top of Firewheel, Sandia-developed tool for automating the challenging parts in Emulytics  Two big technologies Firewheel brings:  Graphs to represent models  Plugin architecture to make automation extensible  Firewheel is scalable: to 75,000 VMs booting in 13 minutes 6

Staging Architecture 7

VM State Snapshots  Currently using QEMU migration-based snapshots  Straightforward to implement because they utilize existing QEMU mechanisms.  Explored two other approaches:  Process-level snapshots  QEMU fork-based snapshots 11

Network Snapshots  Design decisions:  Should we prioritize packet latency or packet ordering?  Choose packet ordering but minimize queuing delay as much as possible  How to pass information to/from the kernel?  Netlink, it is quick, asynchronous, and easy to implement  Where should we place our modifications?  Open vSwitch 12

Why OVS  Can capture packets between cohosted VMs  Easy to install and actively developed  Compatible with virtualization platforms (KVM, Xen, etc.)  Already works with both Minimega and Firewheel 18

Network Snapshot Architecture netif_rx ksoftirqd do_softirq net_rx_action netif_receive_skb Linux NIC Open vSwitch Datapath rx_handler netdev_port_receive ovs_vport_recieve Staghorn ovs_dp_process_received_packet execute_actions do_output ovs_vport_send vport->ops->send(vport, skb) 19

Evaluation – precisetimer.so  Tried to sleep 1 second into the future 60 times and measured how close the sleep was to the desired time.  Results ranged from 1 – 55 ns with mean of 28.05 ns precisetimer.so Error Measurement Sleep error (in nanoseconds) 40 20 0 0 20 40 60 Iteration 20

Evaluation - RabbitMQ RabbitMQ Delay Measurement remote − host same − host type Delay error (in ms) 1.5 1.0 0 10 20 30 Time (seconds) 21 fi fi fl fl ignific

Evaluation – Snapshot Timing  One of the most critical timing aspects of Staghorn is the performance of quiescing the virtual CPUs on each VM vm_density 1 10 14 6 Time to quiesce CPUs (in ms) 5 ● 4 3 ● 2 1 10 14 VM Density 22 ’

Use Cases – Distributed Fuzzer Greedily choose message modification with largest metric to take Fork execution by taking snapshot and returning to it Evaluate metric after different message modifications After many greedy message choices an issue is found 23

Use Cases – Distributed Fuzzer A VM 1 VM 2 24

Use Cases – Distributed Fuzzer A (Paused) (Paused) VM 1 VM 2 25

Use Cases – Distributed Fuzzer X VM 1 VM 2 26

Use Cases – Distributed Fuzzer A (Paused) (Paused) VM 1 VM 2 27

Use Cases – Distributed Fuzzer Y VM 1 VM 2 28

Use Cases – Distributed Debugger 1. Set breakpoint 2. Install Staghorn Trigger 3. Staghorn will wait until the breakpoint is hit to snapshot the system. 29

Use Cases – Debug Experiments  Firewheel user’s experiment failed after about 8 hours.  An 8 hour debug cycle is unacceptable.  Staghorn was used to snapshot before the crash enabling the user to quickly test various fixes. 30

Conclusion/Future Work  Conclusion  We have opened the door to offline analysis and modification for our large-scale emulation based models  Follow-on work:  Improve our performance  Implement/productize more use cases  Better identify how long it takes for CPUs to quiese and improve this time  Improve the stability of process-level snapshots and QEMU fork-based snapshots 31

Any Questions??  Paper: www.sandia.gov/emulytics/staghorn-report.pdf  Contact info: Steven Elliott (selliot@sandia.gov) 32

Staghorn An Automated Large-Scale Distributed System Analysis - PowerPoint PPT Presentation

CPU Quiesce Time A vm_density 1 10 14 6 VM 1 VM 2 Time to quiesce CPUs (in ms) A 5 (paused) (paused) 4 VM 1 VM 2 3 X 2 1 10 14 VM 1 VM 2 VM Density Staghorn An Automated Large-Scale Distributed System Analysis

STAIR-SIDE FARM EDWIN GANO, CHRISTIAN PATTI, & JOSEPH TIDONA Asimina triloba Sweet

Energy Efficiency Analysis of Base Station in Centralized Radio Access Networks INSTITUTE OF

PhysEx A physical simulation language Joshua Nuez Justin Pugliese Steven Ulahannan David Pu

Hope an environment for revitalization sam kalscheur professor: darryl booker thesis

Roadmap u About Childrens HealthWatch u Explore: u How adverse housing circumstances

Sleep Paralysis Madisen Smith What is it? A sleep disorder in which the body is temporarily

Niccol Paganini was so famous violin virtuoso that his audiences thought that he had six

Evidence Based Design for ECIB Using evidence to innovate and elevate WDHB facilities and improve

Stress and autism Headlines from Research and Practice Richard Mills Research Autism and

The e Mi Miss ssing ing Ingredie In edient nt Support portin ing g the Caregiv egivin

The impact of sleep on learning in adolescence Dr Jakke Tamminen Pre-talk questionnaire Using

DETERMINED TO DELIVER Corporate Presentation May 2017 Safe Harbour No representation or

Trident Limited Investor Presentation - Sep 2017 Disclaimer This presentation has been prepared

INVESTOR PRESENTATION February 2018 DISCLAIMER This presentation and the following discussion may

Foundations Of The ProSport Academy Therapist System Dave OSullivan Updated June 2016 My

Launch Presentation Sling improvements Q4 2018 Human Care already offers a broad range of

Presented By Bill Cooper Safety Consultant / Trainer The handling, setting and erection of

www.gridmeshanchor.com What is the Gridmesh Anchor? Developed in response to a need to:

OTT Video Opportunity, threat or something else? 1 9/21/2015 Rick Paulsen Client Relations

Badminton Overview and History w Badminton is a racket sport w Badminton originated in India in

The New Fabric for Sling Method in Expander-Implant Immediate Breast Reconstruction Plastic and

Dual Capability Two Tensioned Rope Systems (DC TTRS 1 ) Technical Rope Rescue Systems Overhaul

Backpack PTA Presentation 12/13/17 The Goal Goal Eliminate the use of bookbags moving from

APACHE SLING & FRIENDS TECH MEETUP BERLIN, 26-28 SEPTEMBER 2016 Test Driven Development with

Sambuz

Useful Links

Newsletter

Mail Us

Staghorn An Automated Large-Scale Distributed System Analysis - PowerPoint PPT Presentation

CPU Quiesce Time A vm_density 1 10 14 6 VM 1 VM 2 Time to quiesce CPUs (in ms) A 5 (paused) (paused) 4 VM 1 VM 2 3 X 2 1 10 14 VM 1 VM 2 VM Density Staghorn An Automated Large-Scale Distributed System Analysis

STAIR-SIDE FARM EDWIN GANO, CHRISTIAN PATTI, &amp; JOSEPH TIDONA Asimina triloba Sweet

Energy Efficiency Analysis of Base Station in Centralized Radio Access Networks INSTITUTE OF

PhysEx A physical simulation language Joshua Nuez Justin Pugliese Steven Ulahannan David Pu

Hope an environment for revitalization sam kalscheur professor: darryl booker thesis

Roadmap u About Childrens HealthWatch u Explore: u How adverse housing circumstances

Sleep Paralysis Madisen Smith What is it? A sleep disorder in which the body is temporarily

Niccol Paganini was so famous violin virtuoso that his audiences thought that he had six

Evidence Based Design for ECIB Using evidence to innovate and elevate WDHB facilities and improve

Stress and autism Headlines from Research and Practice Richard Mills Research Autism and

The e Mi Miss ssing ing Ingredie In edient nt Support portin ing g the Caregiv egivin

The impact of sleep on learning in adolescence Dr Jakke Tamminen Pre-talk questionnaire Using

DETERMINED TO DELIVER Corporate Presentation May 2017 Safe Harbour No representation or

Trident Limited Investor Presentation - Sep 2017 Disclaimer This presentation has been prepared

INVESTOR PRESENTATION February 2018 DISCLAIMER This presentation and the following discussion may

Foundations Of The ProSport Academy Therapist System Dave OSullivan Updated June 2016 My

Launch Presentation Sling improvements Q4 2018 Human Care already offers a broad range of

Presented By Bill Cooper Safety Consultant / Trainer The handling, setting and erection of

www.gridmeshanchor.com What is the Gridmesh Anchor? Developed in response to a need to:

OTT Video Opportunity, threat or something else? 1 9/21/2015 Rick Paulsen Client Relations

Badminton Overview and History w Badminton is a racket sport w Badminton originated in India in

The New Fabric for Sling Method in Expander-Implant Immediate Breast Reconstruction Plastic and

Dual Capability Two Tensioned Rope Systems (DC TTRS 1 ) Technical Rope Rescue Systems Overhaul

Backpack PTA Presentation 12/13/17 The Goal Goal Eliminate the use of bookbags moving from

APACHE SLING &amp; FRIENDS TECH MEETUP BERLIN, 26-28 SEPTEMBER 2016 Test Driven Development with

Sambuz

Useful Links

Newsletter

Mail Us

STAIR-SIDE FARM EDWIN GANO, CHRISTIAN PATTI, & JOSEPH TIDONA Asimina triloba Sweet

APACHE SLING & FRIENDS TECH MEETUP BERLIN, 26-28 SEPTEMBER 2016 Test Driven Development with