SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " - PowerPoint PPT Presentation

" SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " ! ! Mohamed-Amine Baazizi 1 , Houssem Ben Lahmar 2 , Dario Colazzo 3 , " Giorgio Ghelli 4 , Carlo Sartiani 5 " (1) Université Pierre et Marie Curie, France (2) University of Stuttgart, Germany (3) Université Paris-Dauphine, France ! (4) Università di Pisa, Italy (5) Università della Basilicata, Italy "

JSON IN A NUTSHELL ! • Acronym for J ava S cript O bject N otation ! • Very popular format for data exchange (API services) ! • Predominant data model for NoSQL systems (AsterixDB, Mongo, Arango, Couchbase, Elastic, etc) ! • A current candidate schema language (IETF JSON-Schema), several query languages (AQL, SQL++, N1QL, etc) ! 1 !

JSON AND SCHEMAS ! No a priori prescriptive schema ! ! Flexible data management ! Problem : lack the opportunity to: ! 1) understand the structure of potentially large data ! 2) reason about the structural properties of data ! 3) apply schema-based optimizations ! Goal: Inferring a posteriori descriptive schemas for JSON ! 2 !

RELATED WORK ! Semistructured Data: " - approximate/optimal schemas [Nestorov et al. 97, Nestorov et al. 98] " - data guides [Goldman et al. 97] " - expressive type language [Buneman et al. 99] ! XML and RDF: " - concise DTDs [Garofalakis et al. 00, Bex et al. 06] " - summary of XML large collections [Hegewald et al. 06] " - summary of ontology properties [Cebiric et al. 15] ! JSON: " - schema inference in MR (sketch) [Colazzo et al.12] " - summarization [Wang et al. 15, Klettle et al. 15] " - extraction of normalized schema [DiScala et al. 16], adaptiving schema [Spoth et al. 2017] ! different data model, no account for structural variations, no scalability ! 3 !

SCHEMA INFERENCE FEATURES ! Captures complex data and its structural variability ! 1. Produces succinct schemas ! 2. Processes large dataset ! 3. ! ! 4 !

AGENDA ! • Context and Motivation ! • JSON data model and schema language ! • Schema inference mechanism ! • Experimental study ! • Conclusion ! 5 !

JSON DATA MODEL ! { • Null, True, False, Numbers, Strings ! "person": { "firstname": "John", • Records: { l 1 : v 1 , ... , l n : v n } " "lastname": "Smith", each l i is unique in a record ! "coordinates": [120 , 10 ] } • Arrays: [ v 1 , ... , v n ] ! } A JSON value ! 6 !

A SCHEMA LANGUAGE FOR JSON ! • Basic Types: Null, Bool, Num, Str ! { "person": • Record Types: { l 1 : T 1 q, ... , l n : T n q } q ∈ {!, ?} ! { "firstname" : Null + Str , • Union types: T+U ! ("lastname": Str ) ? , "coordinates": [ Num * ] • Array Types: [ T* ] ! } } The JSON-Schema proposal, formalized by Pezoa A JSON schema ! et al. 2016, does not consider union nor compact arrays ! 7 !

SCHEMA INFERENCE MECHANISM ! [ ... 123, “abc” … ] J 1 ! [ ... [ ( Num+ Str + 879, Null + {“lab”: Null … ] Num} )*] J 2 ! [ ... {“lab”: 758 } ] J 3 ! Input collection ! Global schema ! 8 !

SCHEMA INFERENCE MECHANISM ! [ ... 123, “abc” [ ( Num+ Str )*] … ] T 1 ! J 1 ! [ ... Map ! [ ( Num+ Str + 879, [ ( Num+ Null )*] Null + {“lab”: T 2 ! Null … ] Num} )*] J 2 ! [ ... {“lab”: [ { “lab”:Num }* ] 758 } ] T 3 ! Initial schema ! J 3 ! inference ! Input collection ! Global schema ! 9 !

SCHEMA INFERENCE MECHANISM ! [ ... 123, “abc” [ ( Num+ Str )*] … ] T 1 ! J 1 ! Reduce ! [ ... Map ! [ ( Num+ Str + 879, [ ( Num+ Null )*] Null + {“lab”: T 2 ! Null … ] Num} )*] J 2 ! [ ... {“lab”: [ { “lab”:Num }* ] 758 } ] T 3 ! Initial schema ! Schema ! J 3 ! inference ! fusion ! Input collection ! Global schema ! 10 !

SCHEMA INFERENCE MECHANISM ! Initial schema inference ! • generalizes values ! • compacts array content ! Schema fusion: Merge(T,U) ! • collapses identical types ! • detects optional fields ! • captures irregularities ! Sound, commutative and associative ! 11 !

FUSION ILLUSTRATED ! { "person": { "firstname" : Null , "lastname": Str , { "coordinates": [ Num * ] "person": } { } "firstname" : Null + Str , fusion ! T ! ("lastname": Str ) ? , "coordinates": [( Num + Bool) * ] } { } "person": { Merge(T,U) ! "firstname" : Str , "coordinates": [ Bool * ] } } U ! 12 !

EXPERIMENTAL STUDY ! • Main goal : assess succinctness and efficiency ! • Scala-based implementation ! • initial schema inference: extending Json4s [json4s] parser ! • schema fusion: follow the formal specification ! • Settings: 6 nodes, 10 dual core, 64GB RAM, Spark 1.6.1 ! • Datasets: Github, Twitter, and NYTimes stored on HDFS ! 13 !

EXPERIMENTAL RESULTS ! Dataset ! Github ! Twitter ! NYTimes ! Input Data ! Size ! 13 GB ! 21 GB ! 21.3 GB ! # objects ! 1 million ! 9.9 million ! 1.2 million ! avg. AST size ! 495 ! 142 ! 1,238 ! avg. AST height ! 4 ! 3 ! 7 ! Initial Schema inference ! avg. AST size ! 495 ! 135 ! 109 ! Schema fusion ! AST size ! 655 ! 559 ! 139 ! Execution time ! 0.7 min ! 1.7 min ! 2.8 min ! 14 !

CONCLUSION ! Inference of a descriptive schema for a JSON dataset ! • mitigate the lack of schema, incomplete data description ! A simple, yet informative schema language ! • capture the global structure of data and variations ! A distributed and incremental inference mechanism ! • process large datasets, tackle dynamicity ! 15 !

FUTURE DIRECTIONS ! Succinctness vs. precision ! • recover information loss (e.g. field correlation) ! Schemas enriched with statistics ! • cardinality of fields / union branches, typical array size ! Impact on storage and query optimization ! Analysis of other use cases, visualization of schemas ! ! 16 !

THANK YOU ! 17 !

REFERENCES (1/2) ! • G. J. Bex, F. Neven, T. Schwentick, K. Tuyls. Inference of Concise DTDs from XML Data, VLDB’06 ! • P. Buneman, B. Pierce. Union Types for Semistructured Data , DBPL’99 ! • S. Cebiric, F. Goasdoué, I. Manolescu. Query-Oriented Summarization of RDF Graphs. PVLDB’15 ! • D. Colazzo, G, Ghelli. C. Sartiani. Typing Massive JSON Datasets , XLDI’12 ! • M. DiScala, D. Abadi. Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data , SIGMOD’16 ! • M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, K. Shim, XTRACT: A System for Extracting Document Type Descriptors from XML Documents, SIGMOD’00 ! • R. Goldman, J. Widom. DataGuides. Enabling Query Formulation and Optimization in Semistructured Databases, VLDB’97 ! 18 !

REFERENCES (2/2) ! • J. Hegewald, F. Naumann, M. Weis. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , ICDE’06 ! • M. Klettle. U. Störl, S. Scherzinger. Schema Extraction and Structural Outlier Detection for JSON-based NoSQL DataStores , BTW’15 ! • S. Nestorov, S. Abiteboul, R. Motwani. Infererring Structure in Semistructured Data, SIGMOD’97 ! • S. Nestorov, S. Abiteboul, R. Motwani. Extracting Schema from Semistructured Data, SIGMOD’98 ! • F. Pezoa, J. Reutter, F. Suarez, M. Ugarte, D. Vrgoc. Foundations of JSON Schema, WWW’16 ! • W. Spoth, B. Sadat Arab, E.S. Chan, D. Gawlick, … Adaptive Schema Databases. CIDR’17 ! • L.Wang, O. Hassanzadeh, S. Zhang, J. Shi, L. Jiao, J. Zou, C. Wang. Schema Management for Document Stores , VLDB’15 ! 19 !

SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " - PowerPoint PPT Presentation

" SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " ! ! Mohamed-Amine Baazizi 1 , Houssem Ben Lahmar 2 , Dario Colazzo 3 , " Giorgio Ghelli 4 , Carlo Sartiani 5 " (1) Universit Pierre et Marie Curie, France

Linked Open Data data.slub-dresden.de Linked Open Usable Data data.slub-dresden.de schema.org

Introduction to JSON Psychometric Conference 2016 (JavaScript Object Ou Zhang Notation)

1 Web App Development 2 3 JavaScript: JSON JSON: J ava S cript O bject N otation. JSON is a

Lecture 20: JSON JSON JSON stands for JavaScript Object Notation. It is a data format and it has

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Schema Languages Schema Languages Regular expressions a commonly used formalism in schema

Schema Matching in a Large Scale Schema Matching in a Large Scale Personal Schema Based Querying

JSON (JavaScript Object Notation) JSON (JavaScript Object Notation) A lightweight

OData JSON Extensions Ralf Handl, SAP Susan Malaika, IBM Michael Pizzo, Microsoft 2012-07-27,

JL JSON Manipulation Language Json Objects and JLs Motivation [ { name: "John",

Schema validation and evolution for PGs Eugenia Oshurko (ENS Lyon) 7 March 2019 Main ideas

Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent DBPL 2017)

Massive Schema Changes in Facebook Jesse Salomon, Junyi Lu Software Engineer, Production

A JSON Data Processing Language Audrey Copeland, Walter Meyer, Taimur Samee, Rizwan Syed

Jsonpath in examples and roadmap Nikita Glukhov, Oleg Bartunov Postgres Professional SQL/JSON

A RESTful JSON-LD Architecture A RESTful JSON-LD Architecture for Unraveling Hidden References

SPARQL to SQL Translation Based on an Intermediate Query Language Sami Kiminki, Jussi Knuuttila

ArangoDB Siegen, 31 August 2017 Max Neunhffer www.arangodb.com Documents (JSON) In this

UAS Mathematics Programs and Courses Advising Reminders and Updates UAS Mathematics Program

Introduction IMGD 2905 Breakout Work What is data analysis for game development? Where

Formal methods for Safety Assessment of Critical Software at RATP Engineering Department --

+ - E + E- E + + E - = E E - + E + 'b 't .Z .I 4"1il sA lia 'b7- dq p"rnqrrluot

Tableau helps people see and Designing Tableau understand data Tableau Research Suppose you

THoSP: an Algorithm for Nesting Property Graphs Giacomo Bergami 1 Andr Petermann 2 Danilo

SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " - PowerPoint PPT Presentation

" SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " ! ! Mohamed-Amine Baazizi 1 , Houssem Ben Lahmar 2 , Dario Colazzo 3 , " Giorgio Ghelli 4 , Carlo Sartiani 5 " (1) Universit Pierre et Marie Curie, France

Linked Open Data data.slub-dresden.de Linked Open Usable Data data.slub-dresden.de schema.org

Introduction to JSON Psychometric Conference 2016 (JavaScript Object Ou Zhang Notation)

1 Web App Development 2 3 JavaScript: JSON JSON: J ava S cript O bject N otation. JSON is a

Lecture 20: JSON JSON JSON stands for JavaScript Object Notation. It is a data format and it has

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Schema Languages Schema Languages Regular expressions a commonly used formalism in schema

Schema Matching in a Large Scale Schema Matching in a Large Scale Personal Schema Based Querying

JSON (JavaScript Object Notation) JSON (JavaScript Object Notation) A lightweight

OData JSON Extensions Ralf Handl, SAP Susan Malaika, IBM Michael Pizzo, Microsoft 2012-07-27,

JL JSON Manipulation Language Json Objects and JLs Motivation [ { name: &quot;John&quot;,

Schema validation and evolution for PGs Eugenia Oshurko (ENS Lyon) 7 March 2019 Main ideas

Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent DBPL 2017)

Massive Schema Changes in Facebook Jesse Salomon, Junyi Lu Software Engineer, Production

A JSON Data Processing Language Audrey Copeland, Walter Meyer, Taimur Samee, Rizwan Syed

Jsonpath in examples and roadmap Nikita Glukhov, Oleg Bartunov Postgres Professional SQL/JSON

A RESTful JSON-LD Architecture A RESTful JSON-LD Architecture for Unraveling Hidden References

SPARQL to SQL Translation Based on an Intermediate Query Language Sami Kiminki, Jussi Knuuttila

ArangoDB Siegen, 31 August 2017 Max Neunhffer www.arangodb.com Documents (JSON) In this

UAS Mathematics Programs and Courses Advising Reminders and Updates UAS Mathematics Program

Introduction IMGD 2905 Breakout Work What is data analysis for game development? Where

Formal methods for Safety Assessment of Critical Software at RATP Engineering Department --

+ - E + E- E + + E - = E E - + E + 'b 't .Z .I 4&quot;1il sA lia 'b7- dq p&quot;rnqrrluot

Tableau helps people see and Designing Tableau understand data Tableau Research Suppose you

THoSP: an Algorithm for Nesting Property Graphs Giacomo Bergami 1 Andr Petermann 2 Danilo

JL JSON Manipulation Language Json Objects and JLs Motivation [ { name: "John",

+ - E + E- E + + E - = E E - + E + 'b 't .Z .I 4"1il sA lia 'b7- dq p"rnqrrluot