Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017 https:/ /github.com/ground-context/ground ground
Open Source Big Data Community Health Long-term Data L Data Analysis Data Wrangling I A Management F
What was the big data revolution really all about?
Database
A DECOUPLED STACK API / Query Language Big Data Scheduler Query Optimizer Ingest/ Dataflow Engine PubSub Workflow Storage
A DECOUPLED STACK SQL API / Query Language The Good: Agility Scheduler GP ORCA Query Optimizer Ingest/ Dataflow Engine PubSub Workflow Storage
A DECOUPLED STACK SQL The Bad: Dis-integration. GP ORCA
CRISIS: HOW DO WE SHARE INFORMATION?
WHAT IS METADATA?
WHAT IS METADATA? • Data about data • This used to be so simple! • But … schema on use • One of many changes
OPPORTUNITY: A BIGGER CONTEXT Don’t just fill a metadata- sized hole in the big data stack. Lay the groundwork for rich data context.
WHAT IS DATA CONTEXT? All the information surrounding the use of data.
The ABCs of Data Context Application Context: Views, models, code Behavioral Context: Data lineage & usage Change Over Time: Version histories Generated by—and useful to—many applications and components.
I bet social media Hey Janet! We content can predict which already paid for a full customers might cancel Gnip feed from Twitter their accounts! — you can find it here By the way: Sue used this following Janet related table and script. ground
I bet social media Be careful: When Hey Janet! This looks content can predict which people store outputs like Twitter JSON. Many customers might cancel from this script, the people use this script to their accounts! following fields are often turn it into a table. flagged by IT as PII. Janet BTW, have you tried the sentiment analysis package? ground
It looks true! Tweets predict churn! 30 22.5 15 7.5 0 0 4 8 12 16 Janet Sue share ground
I wonder if Janet’s sentiment analysis will help with my discount targeting pipeline. 30 22.5 15 7.5 0 0 4 8 12 16 Sue TweetId TweetId Text Text neg pos neut Sentiment 47 47 “sad!” “sad!” 1 negative 0 0 53 53 “awesome!” “awesome!” 0 positive 1 0 57 57 “go packers!” “go packers!” 0 neutral 0 1 64 64 “fleek!” “fleek!” 0 positive 1 0 ground
Time passes… Uh oh, prediction Oh dear. I accuracy metrics are down! better call a meeting to introduce better governance on sentiment labeler. Prediction Accuracy 100 75 50 Sue 25 0 1/1/2017 00:00 1/1/17 18:00 1/2/17 12:00 TweetId Text neg pos neut FYI: Janet’s TweetId Text Sentiment wrangling script 47 “sad!” 0 0 0 47 “sad!” sadness VERSION HISTORY changed! 53 “awesome!” 0 0 0 53 “awesome!” elation 12/31/2016 00:00 -800 57 “go packers!” 0 0 0 hash: 57 “go packers!” sports 6dda491064bcce14f558bf83867b8c247027c423 64 “fleek!” 0 0 0 64 “fleek!” trendy user: will ground
WHAT DID CONTEXT ENABLE? Self-service catalog, wrangling and analytics. Collective governance of data. 100 75 50 Fueling our model accuracy monitor. 25 0 1/1/2017 00:00 1/2/17 00:00 Figuring out which changes introduced the error. VERSION HISTORY Determining who made the change to user: will help us resolve the issue.
THE BIG CONTEXT 7 7 Where are the interesting technical challenges? 9 All over! Our goal is not to solve all these challenges. It’s to provide an environment to enable solutions. 9
ground Analytics & Reference Vis Data Data Wrangling Quality Catalog & Time Travel Discovery Parsing & Model Featurization Serving COMMON GROUND ABOVEGROUND API TO APPLICATIONS METAMODEL UNDERGROUND API TO SERVICES Scavenging Versioned Search & Scheduling & ID & Auth and Ingestion Storage Query Workflow
Analytics & Reference Vis Data Data Wrangling Quality Catalog & Time Machine Discovery Parsing & Model Featurization Serving COMMON GROUND ABOVEGROUND API TO APPLICATIONS METAMODEL COMMON GROUND CONTEXT MODEL UNDERGROUND API TO SERVICES Scavenging Versioned Search & Scheduling & ID & Auth and Ingestion Storage Query Workflow Pachyderm Chronos
DESIGN REQUIREMENTS • Model-agnostic • Immutable • Scalable • Politically Neutral
Postel’s Law Be conservative in what you do, be liberal in what you accept from others
COMMON GROUND The metamodel A: Model Graphs
Root member k1 member k1: member k2 string Schema 1 element 1 element 2 element 3 Table 1 Table t Column 1 Column 1 Column c Column d Object 2 foreign key member k2: member k1 number RELATIONAL SCHEMA member k11: member k12 string element 1 element 2 element 3 JSON DOCUMENT
COMMON GROUND The versioning model A: Model Graphs B. Version Graphs
COMMON GROUND The versioning model A. Model Graphs B. Version Graphs
COMMON GROUND The usage model C. Lineage Graphs A. Model Graphs B. Version Graphs
SCALABLE, IMMUTABLE BACKEND Longstanding open problem Workloads? Graph queries for metamodel traversal • Log analysis queries for usage • Figure 8: Dwell time analysis. Figure 9: Impact analysis. Room for improvement Goal: compete with in-memory performance • (“the McSherry baseline”) Figure 10: PostgreSQL transitive closure variants.
NEUTRALITY Reminder: There will be k competing solutions for: Data wrangling • Data cataloging • Schema extraction • Feature extraction • Social network analysis • Etc. • This will consolidate somewhat, but only over time • Goal: foster the ecosystem
NEUTRALITY YOU
MANY OPEN RESEARCH QUESTIONS Underground Aboveground • Workloads • Content extraction • Common Ground • Analytic user exhaust representations • Socio-technical networks • No-overwrite versioned DB • Collective governance • Time travel queries: point • Reproducibility and trend Graph queries + • Lifecycle of systems that log analysis learn • Consistency
CURRENT STATUS Alpha Release ground • Integrated with LinkedIn Gobblin, Kafka, Hive Metastore, Github • All components have Docker images on DockerHub • We’d love feedback! www.ground-context.org
Recommend
More recommend