Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017
16 Our Work • large scale data • streaming data • heterogeneous data • private data • uncertain data funded by : European Commission, CNRS, Facebook, IBM Research, FMJH, Inria, Hewlett-Packard Labs, Telecom Italia, Autonomous Province of Trento Themis Palpanas - June 2017
17 Our Work • large scale data ▫ Managing and Analyzing Very Large Scientific Data ▫ infrastructure monitoring, motion capture, genome sequences, fMRI (neuroscience), astronomy • streaming data • heterogeneous data • private data • uncertain data Themis Palpanas - June 2017
18 Our Work • large scale data • streaming data ▫ Real Time Analysis of Data Streams ▫ continuous monitoring, online pattern identification • heterogeneous data • private data • uncertain data Themis Palpanas - June 2017
19 Our Work • large scale data • streaming data • heterogeneous data ▫ Fuse Data from Different Sources ▫ entity resolution, query answering using knowledge graphs, ▫ subjectivity analysis • private data • uncertain data Themis Palpanas - June 2017
20 Our Work • large scale data • streaming data • heterogeneous data • private data • uncertain data ▫ Processing and Mining Uncertain Data ▫ uncertain data series (e.g., sensor measurements) ▫ uncertain graphs (e.g., biological networks) Themis Palpanas - June 2017
entity resolution in large, heterogeneous data spaces Themis Palpanas - June 2017 41
Entity Resolution in Large, Heterogeneous Data Spaces problem develop framework and techniques for entity resolution in very large and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values) scale to web size Themis Palpanas - June 2017 42
Entity Resolution in Large, Heterogeneous Data Spaces problem develop framework and techniques for entity resolution in very large and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values) scale to web size applications: web-scale data integration “which entities in these two web datasets are the same?” entity resolution for heterogeneous web data query answering return a set of unique entities in response to a user query produce high-quality results Themis Palpanas - June 2017 43
Our Work novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) Themis Palpanas - June 2017 44
Our Work novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) we propose framework for entity resolution in heterogeneous data spaces at web scale efficient and effective algorithms for: blocking block purging duplicates propagation block scheduling block pruning comparisons propagation comparisons pruning Themis Palpanas - June 2017 45
Our Work novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) we propose framework for entity resolution in heterogeneous data spaces at web scale efficient and effective algorithms for: blocking block purging duplicates propagation block scheduling block pruning comparisons propagation comparisons pruning Tutorial, links for Papers, Demo, Code, Datasets: http://www.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialScaDS-LeipsigSummerSchool2016v2.pptx Themis Palpanas - June 2017 46
What is the JedAI Toolkit? JedAI can be used in three ways: 1. As an open source library that implements numerous state-of-the-art methods for all steps of an established end-to-end ER workflow. 2. As a desktop application for ER with an intuitive Graphical User Interface that is suitable for both expert and lay users. 3. As a workbench for comparing all performance aspects of various (configurations of) end-to-end ER workflows. Themis Palpanas - June 2017 47
How does the JedAI Toolkit work? JedAI implements the following schema-agnostic, end- to-end workflow for both Clean-Clean and Dirty ER: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Data Block Block Comparison Entity Entity Evaluation Reading Building Cleaning Cleaning Matching Clustering & Storing Reads files Creates Optional step Optional step Executes all Partitions the Stores and containing overlapping that cleans that operates on retained similarity graph presents the entity blocks from the level of comparisons. into equivalence performance blocks. clusters. results profiles and useless individual w.r.t. the golden comparisons comparisons to standard. (repeated, remove the numerous superfluous). useless ones. measures. Themis Palpanas - June 2017 48
How is the JedAI Toolkit structured? • Modular architecture: one module per workflow step. • Extensible architecture (e.g., ontology matching) ??? Themis Palpanas - June 2017 49
How can I build an ER workflow? JedAI supports several established methods for each workflow step: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Data Block Block Comparison Entity Entity Evaluation Reading Building Cleaning Cleaning Matching Clustering & Storing Possible to Choose Specify any Choose Combine Choose Store results read CSV, 1 out of 8 combination of 1 out of 7 1 out of 2 1 out of 6 as a CSV file. RDF/XML files 3 (4) methods methods with methods for methods. Dirty ER. For & relational complementary (including 12 textual DBs in any methods for Meta-blocking). representation Clean-Clean ER, combination! Dirty (Clean- models and 10 1 method is Clean) ER. similarity available. measures. Themis Palpanas - June 2017 50
Which Blocking Methods are included? Block Building Block Cleaning Comparison Cleaning Token Blocking Block Filtering Comparison Propagation Sorted Neighborhood Size-based Block Purging Cardinality Edge Pruning (CEP) Extended Sorted Cardinality-based Block Cardinality Node Pruning (CNP) Neighborhood Purging Attribute Clustering Block Scheduling Weighted Edge Pruning (WEP) Q-Grams Blocking Weighted Node Pruning (WNP) Extended Q-Grams Blocking Reciprocal CNP Suffix Arrays Reciprocal WNP Extended Suffix Arrays Themis Palpanas - June 2017 51
Where can I find JedAI Toolkit? • Project website: http://jedai.scify.org . • Github repositories: – JedAI Library: https://github.com/scify/JedAIToolkit . – JedAI Desktop Application and Workbench: https://github.com/scify/jedai-ui . – All code is implemented using Java 8. – All code is publicly available under Apache License V2.0. • Documentation (slides, videos, etc) available at https://github.com/scify/JedAIToolkit/tree/master/documentation . • When using JedAI, please cite: George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: " JedAI: The Force behind Entity Resolution ", in ESWC 2017. Themis Palpanas - June 2017 52
Which datasets are available for testing? Several datasets are available for testing Can be used for Dirty ER, as well. at https://github.com/scify/JedAIToolkit . Clean-Clean ER D1 D2 Dirty ER Entities (real) Entities Entities (synthetic) Abt-Buy 1,076 1,076 10K 10,000 DBLP-ACM 2,616 2,294 50K 50,000 DBLP-Scholar 2,516 61,353 100K 100,000 Amazon-GP 1,354 3,039 200K 200,00 Movies 27,615 23,182 300K 300,00 DBPedia 1,190,733 2,164,040 1M 1,000,000 2M 2,000,000 Themis Palpanas - June 2017 53
exemplar queries: query answering using examples and knowledge graphs Themis Palpanas - June 2017 54
Exemplar Queries problem given an example element (subgraph) of interest, return a ranked set of similar elements scale to full size size knowledge graphs, provide answers in real-time Themis Palpanas - June 2017 55
Exemplar Queries problem given an example element (subgraph) of interest, return a ranked set of similar elements scale to full size size knowledge graphs, provide answers in real-time applications: data exploration for non-expert users “find company acquisitions like the one of YouTube by Google” fast and easy discovery of facts with same semantics complex similarity queries made easy “find other legal cases where the actors had relationships similar to this” pain-free information search for specialized users Themis Palpanas - June 2017 56
Recommend
More recommend