Massive Scale Magdalena Balazinska University of Washington - PowerPoint PPT Presentation

Interactive Data Processing at Massive Scale Magdalena Balazinska University of Washington http://www.cs.washington.edu/homes/magda

Nuage Project • Science is becoming a data management problem • Existing database management systems are insufficient – Wrong data model, wrong features, insufficient scalability • Nuage project goals (http://nuage.cs.washington.edu/) – Focus on scientific applications – Massive-scale parallel query processing – Cloud computing: DBMS as a service for science • Current collaborators/applications: Astronomy Oceanography Jeff Gardner Bill Howe, UW eScience Andrew Connolly

Astronomy Simulation Use Case • Evolution of large scale structure in the universe – Universe is a set of particles (gas, dark matter, stars) – Particles interact through gravity and hydrodynamics – Output snapshot every few simulation timesteps Simulation No. Particles Snapshot Size Few dozen to few hundred dbtest128g 4.2 million 169MB snapshots per run cosmo50 33.6 million 1.4 GB cosmo25 916.8 million 36 GB • Analysis needs: – Select-project-join (SPJ) queries over snapshot data – Data clustering within snapshot – SPJ and recursive queries over clustered data

Astronomy Simulation Use Case • Implemented SPJ queries over raw data – Relational DBMS (single site and distributed) – Pig/Hadoop – IDL: State-of-the-art in astronomy

Friends-of-Friends Clustering • Efficient clustering algorithm • Implemented in Pig/Hadoop and Dryad/DryadLINQ Best total runtime was 70 min

Problem Statement • Given magnitude of data and queries • Need more than efficient query processing • Users need tools for managing queries at runtime: – Accurate, time-based progress indicators – The ability to see representative partial results – The ability to suspend and resume queries – Intra-query fault-tolerance – Agile query scheduling and resource management • All this without too much runtime overhead

Parallax: Progress Indicator for Parallel Queries • Accurate time-remaining estimates for parallel queries • Why is accurate progress important? – Users need to plan their time – Users need to know when to stop queries • Implementation: MapReduce DAGs in Pig – Pig scripts that translate into MapReduce DAGs

Accuracy is a Challenge Query: Script1 from Pig tutorial Translates into 5 MR jobs Perfect estimate Input data: 5X excite search log 210 MB of data Pig estimate

Parallax Approach MapReduce job instrumentation: Map Task Expected processing speed Tuples remaining Slowdown factor For all pipelines in all jobs Parallelism accounting for skew and variations

Experimental Results Script 1 + UDF - Serial Script 1 - Serial 8 nodes, 32 maps, 32 reduces, zipf 8 nodes, 32 maps, 17 reduces, uniform

Intra-Query Fault Tolerance • Existing intra-query fault-tolerance methods are limited – Parallel DBMSs restart queries when failures occur – MapReduce-style systems materialize all intermediate results – Result: either high-runtime overhead or costly failure recovery! • FTOpt: We have developed a fault-tolerance optimizer – Automatically picks the best fault-tolerance strategy per operator None Materialize Checkpoint

Nuage Project • Nuage project goals (http://nuage.cs.washington.edu/) – Massive-scale parallel query processing – With focus on scientific applications – Cloud computing: DBMS as a service for science

DBMS As a Service for Science • SciFlex: A Cross-scale Cross-domain Scientific Data Management Service – Schema recommendation & data upload utilities – Query, archive, and visualization services – Data intensive computing! – Data, schema, and tool sharing + tool recommendation – Annotations, tagging, disagreement, discussions – Security: need to share safely – SLAs for science • Interesting systems issues involved in building SciFlex • In collaboration with Microsoft Research

Conclusion • Sciences are increasingly data rich • Need efficient, large-scale query processing • Need other data management services too • Nuage/SciFlex project strives to address these needs

Acknowledgments • Students: Nodira Khoussainova, YongChul Kwon, Kristi Morton, Emad Soroush, and Prasang Upadhyaya • Collaborators: Jeff Gardner, Dan Grossman, Bill Howe, Dan Suciu, and the SciDB team

Acknowledgments • This research is partially supported by – NSF CAREER award IIS-0845397 – NSF CRI grant CNS-0454425 – An HP Labs Innovation Research Award – Gifts from Microsoft Research – Balazinska's Microsoft Research Faculty Fellowship

Massive Scale Magdalena Balazinska University of Washington - PowerPoint PPT Presentation

Interactive Data Processing at Massive Scale Magdalena Balazinska University of Washington http://www.cs.washington.edu/homes/magda Nuage Project Science is becoming a data management problem Existing database management systems are

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Fennel: Streaming Graph Partitioning for Massive Scale Graphs Charalampos E. Tsourakakis 1

Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li

Massive Parallel Solutions of Variable Annuity PDEs Janos Benk M.Sc. April 2012 J. Benk, Massive

Exploring the formation epoch of Exploring the formation epoch of massive galaxies massive

System Level Massive MI MO Testbed 5G Innovation Centre, University of Surrey Cobham Wireless

The Final Fates of The Final Fates of Massive Stars Massive Stars K. Nomoto (IPMU, U. Tokyo)

Census of Active Super Massive Black Holes Active Super Massive Black Holes in the Era of

A Wide- -Field Search for Massive Field Search for Massive PopIII PopIII Stars A Wide Stars

Generating Massive Amount of Generating Massive Amount of High- -Quality Random Numbers using

HelenOS in the Year of the Pig HelenOS in the Year of the Pig http://www.helenos.org

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda

Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Same Questions across domains, different interpretations What is it? How do we study it?

Distributed Streaming Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

Fo-An-Di-Qz system 2. This is known as the simple Basalt system since this quaternary system

Josh Bloch Charlie Garrod School of Computer Science

Sambuz

Useful Links

Newsletter

Mail Us