using the RDF Data Cube Vocabulary Sebastian Bayerl, Michael - PowerPoint PPT Presentation

Data-Transformation on historical data using the RDF Data Cube Vocabulary Sebastian Bayerl, Michael Granitzer Department of Media Computer Science University of Passau SWIB15 – Semantic Web in Libraries 22.10.2015

2 Overview • Motivation • Vocabulary and Dataset • Problem Setting and Approach • Workflow • Contributions

3 Motivation • Statistical and historical data source • Statistics of the German Reich (Digitalized) • Access the encapsulated knowledge • Data Analytics and Recommendation • Using Linked Data (RDF Data Cube Vocabulary) • But first: Data Integration • Data Cleaning, -Transformation and -Fusion

4 Source structure

Target structure D1 D1 D1 D1 F D1 D2 D3 a b c d 1 a a a D2 D2 D2 D2 F D1 D2 D3 a a b b 2 b a a D3 F F F F F D1 D2 D3 a 1 2 3 4 3 c b a D3 F F F F … … … … b 5 6 7 8

6 Data Cubes and OLAP • Cube: Multi-dimensional data structure • Observation: measures and dimensions • Measure: numerical fact • Dimension: describes the fact(s) • Enables Data Analytics • OLAP: Online Analytical Processing • Slicing, Dicing, Roll-Up, …

The RDF Data Cube Vocabulary • RDF based vocabulary • Models an OLAP Data Cube • Interlink components with existing concepts http://www.w3.org/TR/vocab-data-cube/

8 Examples 1

9 Examples 2

11 Problem Setting • Data is encapsulated in multiple files • Unusable for sophisticated Data Analysis • Normalization of complex structured data • Dirty and faulty data, structure or annotations • Lots of similar problems in a huge dataset

12 Approach • Use the RDF Data Cube Vocabulary • Enables: Interlinking, merging and analytics • Use an incremental workflow • Identify fine-granular transformations • Implement the research prototype with GUI • Select, configure and chain transformations (save/load) • HTML preview

13 Workflow Java objects RDF 4.Iterate transformations 3. Merge into 7. Convert to 2. Parse fil es 5. Apply 8. Persist RDF single table RDF transformation TEI Relational database or Data-Warehouse HTML Transformations 6. Produce HTML Convert to SQL 1. Load link group Persist Data visualisation Statements

14 Transformations 1. Pre-Normalization • Sanity checks • Data Cleaning • Fix structure (e.g. spans), data and annotations • Delete row (e.g. repeating headers) • 30 more … 2. Normalization • Normalization • Compound normalization: Horizontal or vertical partitions 3. Post-Normalization • Add/merge/delete columns • Add headers/disambiguation • Add metadata • …

15 Advanced transformations • Compound transformations • Combine multiple transformation • Fix more complex problems • E.g. find problematic cells and fix with existing transformation • Transformation suggestions • Find common problems: Repeat symbol, annotation patterns • A step towards automation

16 Contributions • Modular workflow for the Data Integration process • Definition of fine granular transformation steps • Reusable within the same or for other data sources • Lift and enrich historical statistical data • Ready for Data Analytics • Current datset contains 32169 files • > 10% converted • 10 conversion chains https://github.com/bayerls/statistics2cubes

17 Thank you for your attention! Question? RDF Data Cube Vocabulary: Sebastian Bayerl http://www.w3.org/TR/vocab-data-cube/ Department of Media Computer Science University of Passau https://github.com/bayerls/statistics2cubes bayerl@dimis.fim.uni-passau.de

BACKUP

Publication • Bayerl, Sebastian, and Michael Granitzer. "Data-transformation on historical data using the RDF data cube vocabulary." Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business. ACM, 2015.

Abstract This work describes how XML-based TEI documents, containing statistical data, can be normalized, converted and enriched using the RDF Data Cube Vocabulary. In particular we focus on a statistical real world data set, namely the statistics of the German Reich around the year 1880, which are available in the TEI format. The data is embedded in complex structured tables, which are relatively easy to understand for humans but they are not suitable for automated processing and data analysis, without heavy pre-processing, due to their varying structural properties and differing table layouts. Therefore, the complex structured tables must be validated, modified and transformed, until they are suitable for the standardized multi-dimensional data structure - the data cube. This work especially focuses on the transformations necessary to normalize the structure of the tables. Performing validation- and cleaning-steps, resolving row- and column-spans and reordering slices are available transformations among multiple others. By combining exiting transformations, compound operators are implemented, which can handle specific and complex problems. The identification of structural similarities or properties can be used to automatically suggest sequences of transformations. A second focus is on the advantages, which come by using the RDF Data Cube Vocabulary. Also, a research prototype was implemented to execute the workflow and convert the statistical data into data cubes.

using the RDF Data Cube Vocabulary Sebastian Bayerl, Michael - PowerPoint PPT Presentation

Data-Transformation on historical data using the RDF Data Cube Vocabulary Sebastian Bayerl, Michael Granitzer Department of Media Computer Science University of Passau SWIB15 Semantic Web in Libraries 22.10.2015 2 Overview Motivation

+ A Quantitative Survey on the Use of the Cube Vocabulary in the Linked Open Data Cloud Karin

Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of

Outline Cube Release Roadmap Release Notes Cube 7 Highlights Cube 7 Beta

CS 225 Data Structures Au August 28 Cl Classes es and Ref efer eren ence e Variables

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

S ENTENCES IN FOL Cube(a) xCube(x) a is a cube For any x, x is a cube True in a world if a

Fe February 1 Te Templates Wa Wade Fa Fagen-Ul Ulmsch schnei eider er, , Cra Craig

Recall: Indexing into Cube Map Compute R = 2( N V ) N V Object at origin V Use

Explorations of the Rubiks Cube Group Zeb Howell May 2016 Explorations of the Rubiks Cube

CPS Translations and Applications: The Cube and Beyond Section 2: The domain-free -cube Haye

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

Portable In-Browser Data Cube Exploration Kareem El Gebaly, Lukasz Golab, and Jimmy Lin Data

1 Cube geometry (for pillars) Cube Geometry (separate Color) Cube geometry (for pillars) Cube

Welc lcome Conversations with Academia Big Data for Big Challenges: The Swiss Data Cube for

A Probabilistic Model for Data Cube Compression and Query Approximation R. Missaoui, C. Goutte,

CS 225 Data Structures August 31 Memory Wad ade Fag agen-Ulm lmschneid ider Pointers and

Star Elementary Student Driven Data Informed Introducing Literacy Focus Vocabulary is the basis

IMPLEMENTING DATA CUBE EFFICIENTLY Navjeet Singh (presenting) Decision Support System & OLAP

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

bluecube V 4 . 3 1 Blue Cube CMS V4.3 by Digitalcube TABLE OF CONTENTS Introduction Discover

Transforming Swedish Health Care Data to the OMOP CDM October 2017 - OHDSI Symposium Maxim

THE COSO INTEGRATED CONTROL CUBE THE COSO I NTEGRATED CONTROL CUBE 1 COSO Definition of I

Bioinformatics Vocabulary Processing, analyzing, experimenting with data Where does the

using the RDF Data Cube Vocabulary Sebastian Bayerl, Michael - PowerPoint PPT Presentation

Data-Transformation on historical data using the RDF Data Cube Vocabulary Sebastian Bayerl, Michael Granitzer Department of Media Computer Science University of Passau SWIB15 Semantic Web in Libraries 22.10.2015 2 Overview Motivation

+ A Quantitative Survey on the Use of the Cube Vocabulary in the Linked Open Data Cloud Karin

Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of

Outline Cube Release Roadmap Release Notes Cube 7 Highlights Cube 7 Beta

CS 225 Data Structures Au August 28 Cl Classes es and Ref efer eren ence e Variables

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

S ENTENCES IN FOL Cube(a) xCube(x) a is a cube For any x, x is a cube True in a world if a

Fe February 1 Te Templates Wa Wade Fa Fagen-Ul Ulmsch schnei eider er, , Cra Craig

Recall: Indexing into Cube Map Compute R = 2( N V ) N V Object at origin V Use

Explorations of the Rubiks Cube Group Zeb Howell May 2016 Explorations of the Rubiks Cube

CPS Translations and Applications: The Cube and Beyond Section 2: The domain-free -cube Haye

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

Portable In-Browser Data Cube Exploration Kareem El Gebaly, Lukasz Golab, and Jimmy Lin Data

1 Cube geometry (for pillars) Cube Geometry (separate Color) Cube geometry (for pillars) Cube

Welc lcome Conversations with Academia Big Data for Big Challenges: The Swiss Data Cube for

A Probabilistic Model for Data Cube Compression and Query Approximation R. Missaoui, C. Goutte,

CS 225 Data Structures August 31 Memory Wad ade Fag agen-Ulm lmschneid ider Pointers and

Star Elementary Student Driven Data Informed Introducing Literacy Focus Vocabulary is the basis

IMPLEMENTING DATA CUBE EFFICIENTLY Navjeet Singh (presenting) Decision Support System &amp; OLAP

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

bluecube V 4 . 3 1 Blue Cube CMS V4.3 by Digitalcube TABLE OF CONTENTS Introduction Discover

Transforming Swedish Health Care Data to the OMOP CDM October 2017 - OHDSI Symposium Maxim

THE COSO INTEGRATED CONTROL CUBE THE COSO I NTEGRATED CONTROL CUBE 1 COSO Definition of I

Bioinformatics Vocabulary Processing, analyzing, experimenting with data Where does the

IMPLEMENTING DATA CUBE EFFICIENTLY Navjeet Singh (presenting) Decision Support System & OLAP