Runtime Analysis and Testing in the Cloud Dr. Wolfgang Grieskamp - PowerPoint PPT Presentation

Runtime Analysis and Testing in the Cloud Dr. Wolfgang Grieskamp Staff Software Engineer, Google USA CREST Workshop, May 20 th , 2012

About me  < 2000: Researcher and Lecturer at Technical University of Berlin  2000-2006: Senior Researcher, Microsoft Research  2007-2011: Principal Architect, Microsoft Windows Interoperability Team, Server and Cloud division  Since 4/2011: Staff Engineer, Google+ platform and tools, Google  DISCLAIMER : This talk does not necessarily represent Google’s opinion or direction .

About this talk Will talk about:  How Google monitors and tests Cloud software  Quick pitch how Google uses the Cloud itself for development Will assume:  You know something about software engineering and about Cloud computing

My Viewpoint  As a researcher who tries to identify open problems (and none-problems!)  As an engineer who tries to understand and improve the process.

What is Cloud Computing? From Wikipedia, the free encyclopedia Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software and information are provided to computers and other devices as a utility (like the electricity grid) over a network (typically the Internet).

Cloud Stack SAAS Software As A Service PAAS Platform As A Service IAAS Infrastructure As A Service

Runtime Analysis and Testing @ Google Production Monitoring Level A simulation of the Staging production Uses monitoring environment with Load testing Level techniques faked identities etc. Automated testing Integration End-to-End testing of every code with partial change over the component Level dependency isolation closure Super-strict component Extensive use of Unit Level isolation using e.g. mock-based dependency testing injection

Monitoring and Testing What the heck is the difference?  In testing…  we simulate (mock) the environment (aka user)  we don’t care as much about performance overhead  In monitoring…  we are interested mostly in general health not detailed functionality (assumed its already tested)  we use stochastic methods more frequently  Otherwise many things similar.

Anatomy of a Data Center Data Center A Data Center B Controller …… Server Controller … Server Server Server Server Storage Storage Storage Storage Storage Note: abstracted and simplified

Anatomy of a Server Data Center A Data Center B Server (VM) Controller Controller …… Server Controller Job Job Job … Server Server Server Server Monitor Monitor Monitor Storage Storage Storage Storage Alert Logs Note: abstracted and simplified

Anatomy of a Service Data Center A Data Center B Service (across Servers) Controller Job Job …… Server Controller Job Job Job … Server Server Server Server Job Storage Storage Storage Storage Storage Storage Note: abstracted and simplified

Monitoring Types @ Google  Black Box Monitoring  White Box Monitoring  [Log Analysis]

Black Box Monitoring Job How its done @Google Monitor  Frequently send requests and analyze the response  Possible because server jobs are ‘stateless’ and always input enabled  If failure rate over a certain time interval exceeds a given ratio, raise an alert and page an engineer  Engineers aim for minimizing paging and avoiding false positives

Black Box Monitoring: Job How its done @ Google (cont.) Monitor  There are rule based languages for defining request/ responses. Each rule:  Synthesizes an HTTP request  Analyzes the response using a regular expression  Specifies frequency and allowed failure ratio  Rules are like tests: a simple trigger and a simple response analysis  Monitors can be also custom code

Black Box Monitoring: Job How is it doing? Monitor  Is the ‘stateless’ hypothesis feasible?  Yes, as these are health tests, state can be ignored  What is the relation to testing?  In theory very similar, only that the environment is not mocked.  In practice uses quite different frameworks/languages  What about service/system level monitoring?  Its only about one job.  Doesn’t give failure root cause (it only measures a symptom )

White-Box Monitoring Job How its done @Google Monitor  Server exports collection of probe points (variables)  Memory, # RPCs, # Failures, etc.  Monitor collects time series of those values and computes functions over them  Dashboards prepare information graphically  Mostly used for diagnosis by humans

White-Box Monitoring: Job How its done @ Google (cont.) Monitor  Declarative language for time series computations  Collects samples from the server by memory scraping  Merging of similar data from multiple servers running the same job  Rich support for diagram rendering in the browser

White-Box Monitoring: Job How is it doing? Monitor  Design for monitorability/testability?  Its already ubiquitous throughout, since software engineers are themselves on-call…  Distributed collection/network load?  Not really an issue because it’s sample based  Relation to testing?  Same as with black-box – should be a common framework.  Automatic root cause analysis and self-repair?  Current systems mostly build for human analysis and repair.  Self-repair would be a big thing.

Integration Testing: Job Job Job How its done @Google Storage  Two or more components are plugged together with a partially mocked environment  The environment provides stimuli and checks expectations  Usually runs on a single machine  Can be deployed to the cloud for large scale testing

Integration Testing Job Job Job How is it doing? Storage  Integration test are often ‘flaky’ (unreliable)  Difficulty to construct mocked component’s precise behavior (its more than a simple mock in a unit test)  Difficulty to synthesize mocked component’s initial state (it may have a complex state)  Potential solution: model-based testing and simulation

Exploiting the Cloud for Development

Idle Resources Peak demand problem: as with other utilities, the cloud must have capacity to deal with peak times: 7am, 7pm, etc.  Huge amounts of idle computing resources available in the DCs outside of those peak times  Literally hundreds of VMs may be available for a single engineer on a low-priority job base è Game changer for software development tools

Using the Cloud for Dev @ Google  Distributed/parallel build  Every engineer can build all of Google’s code + third party open source code in a matter of minutes (sequential build would take days)  Works by constructing the dependency graph than using map/ reduce technology  Distributed/parallel test  Changes on the code base are continuously tested against all dependent targets once submitted  Failures can be tracked down very precisely to the given change which have introduced them  Check out http://google-engtools.blogspot.com/ for details

Conclusions  The Cloud brings new challenges for runtime analysis and testing.  Many of them are adequately solved – others wait for improvements.  The Cloud brings new opportunities for software development tools.

Runtime Analysis and Testing in the Cloud Dr. Wolfgang Grieskamp - PowerPoint PPT Presentation

Runtime Analysis and Testing in the Cloud Dr. Wolfgang Grieskamp Staff Software Engineer, Google USA CREST Workshop, May 20 th , 2012 About me < 2000: Researcher and Lecturer at Technical University of Berlin 2000-2006: Senior

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

TenantGuard: Scalable Runtime Verification of Cloud-Wide VM-Level Network Isolation Han Song

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Software testing Software Testing Introduction Testing levels Automated testing Principles and

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

1. Test page This page is for testing. This page is for testing. This page is for testing.

SAS and (the) Cloud Dave Annis SAS Solutions onDemand SAS and (the) Cloud Everyones Cloud

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Integration and Presentation Katsuro Inoue Osaka University Software Engineering Laboratory,

The Metrics Trap Michael Feathers We all know fold foldl1 (+) [1..5] foldl1 (+) [1..5] 15 We

PredictingFaultProneModules BasedonMetricsTransitions

Calculation and Optimization of Thresholds for Sets of Software Metrics Steffen Herbold, Jens

CDE: Automatically create portable software packages Philip Guo and Dawson Engler Stanford

Spack Supercomputing 2020 Full-day Tutorial November 9-10, 2020 The most recent version of these

T The software package for the simulation, reconstruction, and analysis of Liquid Argon TPC

Todays Lecture INF 111 / CSE 121: UML Software Tools and Methods Package Diagrams