towards schema independent querying on document data
play

Towards Schema-independent Querying on Document Data Stores H. BEN - PDF document

Towards Schema-independent Querying on Document Data Stores H. BEN HAMADOU 1 , F. GHOZZI 2 , A. PENINOU 1 , O. TESTE 1 1 IRIT , Univesit de Toulouse - France UT3, UT2J 2 MIRACL, Universit de Sfax - Tunisie ISIMS hamdi.ben-hamadou@irit.fr


  1. Towards Schema-independent Querying on Document Data Stores H. BEN HAMADOU 1 , F. GHOZZI 2 , A. PENINOU 1 , O. TESTE 1 1 IRIT , Univesité de Toulouse - France UT3, UT2J 2 MIRACL, Université de Sfax - Tunisie ISIMS hamdi.ben-hamadou@irit.fr 26-03-2018, DOLAP’18 H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 1 / 28 Introduction Document-oriented Database Documement-oriented Database Data format: Semi-structured documents, JSON, BSON . . . Data model: Schema-less Advantage: Big data support, Scalability, Availability Example: MongoDB, CouchDB Applications: Web, IoT, social media . . . Interrogation: JDBC, Drivers, API, Command line . . . H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 2 / 28

  2. Introduction Backgrounds Modeling Multi-structured Data Collection C = { d 1 , . . . , d c } Document d i = ( k i , v i ) k i is the document’ identi fi er. v i = { a i , 1 : v i , 1 , . . . , a i , n : v i , n i } is the document’ value. Document Schema s i = { p 1 , . . . , p m } where p i is a path leading to leaf node in document d i . Collection Schema S = � � C � i = 1 s i H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 3 / 28 Introduction Backgrounds Structural Heterogeneity Document 3 Document 1 { "_id": 3, { "title": "Despicable Me 3", "_id": 1, "year":2017 "title":"Fast and furious", } "year":2017 , "language":"English" Document 4 } { Document 2 "_id": 4, "title": "The Hobbit", { "versions": "_id": 2, [{ "title": "Titanic", "year":2012, "details": "language":"English" { }, "year":1997, { "language":"English" "year":2013, } "language":"French" } }] } H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 4 / 28

  3. Introduction Querying Semi-structured Documents Query Operators Kernel of Unary Operators k = { π , σ } Projection Operator π ( A ) ( C in ) = C out The project operator reduces the initial schemas of documents to a fi nite subset of attributes A . Selection Operator σ ( P ) ( C in ) = C out The select operator retrieves only documents that match the selection condition P expressed in normal form ( Norm p ). H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 5 / 28 Introduction Querying Semi-structured Documents Querying Multi-structured Data Problem π (“ title ” , “ year ”) (C) Document 3 Document 1 { "_id": 3, { "title": "Despicable Me 3", "_id": 1, "year":2017 "title":"Fast and furious", } "year":2017 , "language":"English" Document 4 } { Document 2 "_id": 4, "title": "The Hobbit", "versions": "_id": 2, [{ "title": "Titanic", "year":2012 , "details": "language":"English" { }, "year":1997 , { "language":"English" "year":2013 , } "language":"French" }] } H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 6 / 28

  4. Introduction Querying Semi-structured Documents Querying Multi-structured Data Problem π (“ title ” , “ year ”) (C) Document 3 Document 1 { "_id": 3, { "title": "Despicable Me 3", "_id": 1, "year":2017 "title": "Fast and furious", } "year":2017 "language":"English" } Document 4 Document 2 { "_id": 4, "title": "The Hobbit"", { "versions": "_id": 2, [{ "title": "Titanic", "year":2012 "details": "language":"English" { }, "year":1997 { "language":"English" "year":2013 } "language":"French" } }] } H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 6 / 28 Introduction Querying Semi-structured Documents Querying Multi-structured Data Problem π (“ title ” , “ year ” , “ details . year ” , “ versions . 1 . year ” , “ versions . 2 . year ”) (C) H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 6 / 28

  5. Introduction Querying Semi-structured Documents Querying Multi-structured Data Problem π (“ title ” , “ year ” , “ details . year ” , “ versions . 1 . year ” , “ versions . 2 . year ”) ( C ) Document 3 Document 1 { "_id": 3, { "title": "Despicable Me 3", "_id": 1, "year":2017 "title": "Fast and furious", } "year":2017 "language":"English" } Document 4 Document 2 { "_id": 4, "title": "The Hobbit"", { "versions": "_id": 2, [{ "title": "Titanic", "year":2012 "details": "language":"English" { }, "year":1997 { "language":"English" "year":2013 } "language":"French" } }] } H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 6 / 28 Querying Heterogeneous Documents Plan Introduction 1 Querying Heterogeneous Documents 2 Experiments 3 Conclusion & perspectives 4 H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 7 / 28

  6. Querying Heterogeneous Documents State of The Art Physical data transformation Flattening data. Using additional databases. Introducing new structures. [( Chasseuretal ., 2013 ) , ( Taharaetal ., 2014 )( Taharaetal ., 2014 )] ⇒ Need to learn new schema . ⇒ Loss of initial document schemas / structures . ⇒ Need to re − build new schemas when structres are changed . H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 8 / 28 Querying Heterogeneous Documents State of The Art Virtual data transformation Inferring existing schemas. Building an uni fi ed schema. Tracking di ff erent schemas versions. [(Baazizi et al., 2017),(Ruiz et al., 2015),(Wang et al., 2015)] ⇒ Need to learn new structures . ⇒ Querying is only limited to structural level . ⇒ Heterogeneity is manually managed to formulate application queries . H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 9 / 28

  7. Querying Heterogeneous Documents Our Approach EasyQ Figure: EasyQ Architecture H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 10 / 28 Querying Heterogeneous Documents Dictionary Dictionary The dictionary dict C constructed from a collection C is de fi ned by dict C = { ( p k , � k ) } ∀ p k ∈ S C p k ∈ S C is a path leading to a leaf node which is present in at least one document; � k = { p p k , 1 , . . . , p p k , q } ⊆ S C , is a set of navigational paths leading to p k ; H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 11 / 28

  8. Querying Heterogeneous Documents Dictionary Dictionary Construction Process “ year ” Document 3 Document 1 { "_id": 3, { "title": "Despicable Me 3", "_id": 1, "year":2017 "title": "Fast and furious", } "year":2017, "language":"English" } Document 4 Document 2 { "_id": 4, "title": "The Hobbit", "versions": "_id": 2, [{ "title": "Titanic", "year":2012, "details": "language":"English" { }, "year":1997, { "language":"English" "year":2013 } "language":"French" }] } H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 12 / 28 Querying Heterogeneous Documents Dictionary Dictionary Construction Process dict = { (“ year �� , { “ year �� , “ details . year �� , “ versions . 1 . year �� , “ versions . 2 . year ” } ) } H. BEN HAMADOU et al. (IRIT) Schema-independent Querying 26-03-2018, DOLAP’18 12 / 28

Recommend


More recommend