ground a data context service
play

Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, - PowerPoint PPT Presentation

Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017 https:/ /github.com/ground-context/ground ground Open Source Big Data Community Health Long-term Data L Data Analysis Data Wrangling I A


  1. Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017 https:/ /github.com/ground-context/ground ground

  2. Open Source Big Data Community Health Long-term Data L Data Analysis Data Wrangling I A Management F

  3. What was the big data revolution really all about?

  4. Database

  5. A DECOUPLED STACK API / Query Language Big Data Scheduler Query Optimizer Ingest/ Dataflow Engine PubSub Workflow Storage

  6. A DECOUPLED STACK SQL API / Query Language The Good: Agility Scheduler GP ORCA Query Optimizer Ingest/ Dataflow Engine PubSub Workflow Storage

  7. A DECOUPLED STACK SQL The Bad: Dis-integration. GP ORCA

  8. CRISIS: HOW DO WE SHARE INFORMATION?

  9. WHAT IS METADATA?

  10. WHAT IS METADATA? • Data about data • This used to be so simple! • But … schema on use • One of many changes

  11. OPPORTUNITY: A BIGGER CONTEXT Don’t just fill a metadata- sized hole in the big data stack. Lay the groundwork for rich 
 data context.

  12. WHAT IS DATA CONTEXT? All the information surrounding the use of data.

  13. The ABCs of Data Context Application Context: Views, models, code Behavioral Context: Data lineage & usage 
 Change Over Time: Version histories Generated by—and useful to—many applications and components.

  14. I bet social media Hey Janet! We content can predict which already paid for a full customers might cancel Gnip feed from Twitter their accounts! — you can find it here By the way: Sue used this following Janet related table and script. ground

  15. I bet social media Be careful: When Hey Janet! This looks content can predict which people store outputs like Twitter JSON. Many customers might cancel from this script, the people use this script to their accounts! following fields are often turn it into a table. flagged by IT as PII. Janet BTW, have you tried the sentiment analysis package? ground

  16. It looks true! 
 Tweets predict churn! 30 22.5 15 7.5 0 0 4 8 12 16 Janet Sue share ground

  17. I wonder if Janet’s sentiment analysis will help with my discount targeting pipeline. 30 22.5 15 7.5 0 0 4 8 12 16 Sue TweetId TweetId Text Text neg pos neut Sentiment 47 47 “sad!” “sad!” 1 negative 0 0 53 53 “awesome!” “awesome!” 0 positive 1 0 57 57 “go packers!” “go packers!” 0 neutral 0 1 64 64 “fleek!” “fleek!” 0 positive 1 0 ground

  18. Time passes… Uh oh, prediction Oh dear. I accuracy metrics are down! better call a meeting to introduce better governance on sentiment labeler. Prediction Accuracy 100 75 50 Sue 25 0 1/1/2017 00:00 1/1/17 18:00 1/2/17 12:00 TweetId Text neg pos neut FYI: Janet’s TweetId Text Sentiment wrangling script 47 “sad!” 0 0 0 47 “sad!” sadness VERSION HISTORY changed! 53 “awesome!” 0 0 0 53 “awesome!” elation 12/31/2016 00:00 -800 
 57 “go packers!” 0 0 0 hash: 57 “go packers!” sports 6dda491064bcce14f558bf83867b8c247027c423 
 64 “fleek!” 0 0 0 64 “fleek!” trendy user: will ground

  19. WHAT DID CONTEXT ENABLE? Self-service catalog, wrangling and analytics. 
 Collective governance of data. 100 75 50 Fueling our model accuracy monitor. 25 0 1/1/2017 00:00 1/2/17 00:00 Figuring out which changes introduced the error. VERSION HISTORY Determining who made the change to user: will help us resolve the issue.

  20. THE BIG CONTEXT 7 7 Where are the interesting technical challenges? 9 All over! Our goal is not to solve all these challenges. It’s to provide an environment to enable solutions. 9

  21. ground Analytics & 
 Reference 
 Vis Data Data 
 Wrangling Quality Catalog & 
 Time Travel Discovery Parsing & 
 Model 
 Featurization Serving COMMON GROUND ABOVEGROUND API TO APPLICATIONS METAMODEL UNDERGROUND API TO SERVICES Scavenging 
 Versioned 
 Search & 
 Scheduling & 
 ID & Auth and Ingestion Storage Query Workflow

  22. Analytics & 
 Reference 
 Vis Data Data 
 Wrangling Quality Catalog & 
 Time Machine Discovery Parsing & 
 Model 
 Featurization Serving COMMON GROUND ABOVEGROUND API TO APPLICATIONS METAMODEL COMMON GROUND CONTEXT MODEL UNDERGROUND API TO SERVICES Scavenging 
 Versioned 
 Search & 
 Scheduling & 
 ID & Auth and Ingestion Storage Query Workflow Pachyderm Chronos

  23. DESIGN REQUIREMENTS • Model-agnostic • Immutable • Scalable • Politically Neutral

  24. Postel’s Law Be conservative in what you do, 
 be liberal in what you accept from others

  25. COMMON GROUND The metamodel A: Model Graphs

  26. Root member k1 member k1: member k2 string Schema 1 element 1 element 2 element 3 Table 1 Table t Column 1 Column 1 Column c Column d Object 2 foreign key member k2: 
 member k1 number RELATIONAL SCHEMA member k11: member k12 string element 1 element 2 element 3 JSON DOCUMENT

  27. COMMON GROUND The versioning model A: Model Graphs B. Version Graphs

  28. COMMON GROUND The versioning model A. Model Graphs B. Version Graphs

  29. COMMON GROUND The usage model C. Lineage Graphs A. Model Graphs B. Version Graphs

  30. SCALABLE, IMMUTABLE BACKEND Longstanding open problem Workloads? Graph queries for metamodel traversal • Log analysis queries for usage • Figure 8: Dwell time analysis. Figure 9: Impact analysis. Room for improvement Goal: compete with in-memory performance 
 • (“the McSherry baseline”) Figure 10: PostgreSQL transitive closure variants.

  31. NEUTRALITY Reminder: There will be k competing solutions for: Data wrangling • Data cataloging • Schema extraction • Feature extraction • Social network analysis • Etc. • This will consolidate somewhat, but only over time • Goal: foster the ecosystem

  32. NEUTRALITY YOU

  33. MANY OPEN RESEARCH QUESTIONS Underground Aboveground • Workloads • Content extraction • Common Ground • Analytic user exhaust representations • Socio-technical networks • No-overwrite versioned DB • Collective governance • Time travel queries: point • Reproducibility and trend Graph queries + • Lifecycle of systems that log analysis learn • Consistency

  34. CURRENT STATUS Alpha Release ground • Integrated with LinkedIn Gobblin, Kafka, Hive Metastore, Github • All components have Docker images on DockerHub • We’d love feedback! www.ground-context.org

Recommend


More recommend