Minimal Project Join REQ [Shen et al., 2014] Main idea : Find the set of queries that approximately return a set of examples Partial query Minimal PJ Queries Q’ table • valid: every tuple is present in A B C query results 1 Mike ThinkPad Office minimal: any removal in query • 2 Mary iPad tree gets to an invalid query 3 Bob Dropbox 28 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
Candidate Query Generation [Shen et al., 2014] ● Use candidate network generation algorithm A B C (Hristidis 2002) 1 Mike ThinkPad Office 2 Mary iPad 3 Bob Dropbox CQ 1 CQ 2 Owner CQ 3 Owner Sales 1. Generate join tree 𝐾 A B A B C A B C Employee Device 2. Generate mapping 𝜚 Customer Device App Employee Device App C 3. Check minimal: ESR - Every leaf node CQ 4 CQ 5 Owner Owner contains a column that C B A B is mapped by an input App Device Employee Device App column C ESR A ESR Employee 29 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
Validity verification [Shen et al., 2014] Candidate Naïve: check all candidate queries query singularly if they return ALL examples Better: exploit substructures in candidate Substructures queries for pruning Sub 1 Owner Sub 1 fails => Best: adaptively select the substructures B 𝐷𝑅 + invalid A to have the min number of evaluations Employee Device NP-hard Sub 1 fails => Owner Sub 2 Sub 2 fails C A B Device Employee App 30 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Minimal Project Join REQ [Psallidas et al., 2015] Main idea: Allow missing rows/columns and rank the k best queries Output: Top-k PJ Queries Sales Products Customers Partial query S4 Name First Name Last Name table Sales Products Customers A B C City 1 John Smith Xbox Name Last Name 2 Jill Hans Surface Name 31 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Ranking score [Psallidas et al., 2015] Linear combination of row score and column score • 𝛽 = 1 penalizes 𝛽 ∗ 𝑡𝑑𝑝𝑠𝑓 Aij 𝑅 + 1 − 𝛽 ∗ 𝑡𝑑𝑝𝑠𝑓 kiF 𝑅 missing rows • 𝛽 = 0 penalizes 𝑅 missing columns Sales Sales Row Score Products Customers Row score Products Customers City John Smith Xbox 3 3 Jill Hans Surface 2 1 Name City Name Last Name Name First Name Last Name Xbox St. John Smith 5 4 Xbox John Smith iPhone Michael Douglas iPhone Montpellier Douglas Surface Redmond Johnson Surface Jill Johnson Sales Sales Column score Products Customers John Smith Xbox Products Customers City Name First Name Last Name Jill Hans Surface City Name Last Name Xbox John Name Smith 2 1 2 5 Xbox St. John Column Smith iPhone Jill Johnson 2 1 1 4 iPhone Montpellier Johnson Michael Surface Douglas Score Surface Redmond Douglas 32 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
S4 Optimizations [Psallidas et al., 2015] Row score is always bounded by the column score (row containment is more restrictive) Upper bound Exploit inverted indexes on columns/rows Stop when current upper bound score is less than the k-th ranked Early evaluated query termination Scan queries on decreasing upper bound Reuse common subparts in the candidate queries Caching 33 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Reverse engineering queries (REQ) Lack of user models! REQ Exact Approximate One-shot Interactive Minimal Top-k • Query From • Query by output • Discovering • S4: Top-k examples (QFE) - TALOS Queries based Spreadsheet on Examples style • Interactive • REQ SPJ queries from inference of join examples queries 34 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
Examples for query suggestion: Blaeu [Sellam et al., 2016] Main idea : Allow interactive navigation of the query space in a hierarchy Query Results Blaeu Query navigations or Query 35 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
� Examples for query suggestion: Blaeu [Sellam et al., 2016] Query results Attribute 2 Given a result of an example query Q, explore the data through data maps = partitions Output : Set of query refinements Attribute 1 Problem : User utility is unknown 𝑣: 𝐸𝐶 → −1,1 , 𝑉 𝑅 = Y 𝑣(𝑢) Cluster analysis for result exploration • M∈t Zoom and projection operations • User utility User model • 36 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
� Examples for query suggestion: Blaeu [Sellam et al., 2016] Find the partition 𝒟 = 𝐷 / , … , 𝐷 ? of the results of Q 𝑣: 𝐸𝐶 → −1,1 , 𝑉 𝐷 = Y 𝑣(𝑢) such that exists C w ∈ 𝒟: 𝑉 𝐷 x > 𝑉(𝑅) M∈z Unknown User utility Solution : interesting tuples are close to each other within a maximum separation threshold 𝜄(𝒟) Detect clusters Organize clusters (k-medoid) (decision tree) Inference 37 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Where we are Relational databases Machine learning Textual data Graphs and networks Challenges and Remarks 38 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Examples for textual data Few methods for textual data using examples Snowball [Agichtein 2000] DIPRE [Brin 1999] Entity Web table Search by Extraction completion example [Hanafi 2017] [Yakout 2013] Serendipitous Using example search queries [Bordino 2013] [Zhu 2014] 39 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
Entity extraction by-example (SEER) [Hanafi et al., 2017] Main idea: Create rules to extract wanted information from documents using examples SEER Output : Extraction P: Percentage = 1.0 = 1.0 rules = 0.4 D: {5, 6} = 0.4 D: {percent, %} = 0.4 R: [0-9]+ = 0.2 D: {percent, %} = 0.4 = 0.3 40 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Learning rules [Hanafi et al., 2017] Example: 5 percent up 1. Enumerate possible primitives per example token P: Number 5 L: ‘percent’ percent … P: Integer R: [A-Za-z]+ L: ‘5’ T: 0-1 R: [0-9]+ 2. Assign scores to primitives Token gap Literal Pre- ≺ Dictionary ≺ Regex builts ≺ ≺ Dubai : T: 0-1 L: ‘Dubai’ P: City 0 1 41 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Learning rules (cont’d) [Hanafi et al., 2017] 3. Generate rules Example: Example: 5 percent 6% Tokens: 5 percent P: Percentage = 1.0 L: ‘%’ = 0.4 Tree: P: Percentage = 1.0 R: symbols = 0.2 L: ‘6’ = 0.4 L: ‘percent’ = 0.4 L: ‘%’ = 0.4 R: [0-9]+ = 0.2 L: ‘5’ = 0.4 R: [A-Za-z]+ = 0.2 R: symbols = 0.2 L: ‘percent’ = 0.4 R: [0-9]+ = 0.2 R: [A-Za-z]+ = 0.2 Rule: R: [0-9]+ = 0.2 L: ‘percent’ = 0.4 4. Merge Intersection: [ 5 percent, 6% ] P: Percentage = 1.0 D: {5, 6} = 0.4 D: {percent, %} = 0.4 R: [0-9]+ = 0.2 D: {percent, %} = 0.4 42 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Web tables completion (InfoGather) [Yakout et al., 2012] Main idea: Complete tables using partial information about tuples Part No Mfg Model Brand Web tables DSC W570 Sony S80 Nikon Part No Mfg T1460 Benq Easyshare CD44 Kodak DSC W570 Sony Optio E60 Pentax DSC W570 Sony T1460 Benq S8100 Nikon Optio E60 Pentax Optio E60 Pentax S8100 Nikon Model Brand Model Brand S80 S80 Benq InfoGather A10 A10 Innostream GX-1S GX-1S Samsung T1460 T1460 Benq Incomplete table Complete table 43 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Augmentation framework [Yakout et al., 2012] Direct Match Approach (DMA) Web tables ● Traditional schema matching techniques using Input the attribute names and the values in the column Indirect matching |𝑈 ∩ • 𝑅| 𝑗𝑔 𝑅. 𝐵 ≈ 𝑈. 𝐶 table 𝑇 |C} 𝑈 = • min( 𝑅 , |𝑈|) 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 44 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
� � Ranking tables using PageRank PageRank • Personalized PageRank (PPR) • Adjacency Query Table matrix 𝜌 Š 𝑤 = 𝜗 𝜀 Š 𝑤 + 1 − 𝜗 Y 𝜌 Š 𝑥 𝛽 j,Ž {j| j,Ž ∈•} Topic Sensitive Pagerank (TSP) • ⃗ + 1 − 𝜗 𝜌 • 𝑤 = 𝜗 𝛾 Y 𝜌 • 𝑥 𝛽 j,Ž Nodes è Web Tables Edges è Tables Similarity {j| j,Ž ∈•} Topic vector Topic weight è DMA score 45 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
Serendipitous search [Bordino et al., 2013] Main idea: Use related entities and query logs to find serendipitous searches Francisco Pizarro Peru America Rafting Query Amazon Machu Picchu Logs ... Connected entities rafting excursion down the urubamba river el dorado temple of sun Serendipitous indios quechuas Search map of peru sapa inca Searches related to Document Document content 46 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Find queries using entity-query graph [Bordino et al., 2013] Query-flow graph with entity nodes Three types of arcs: 1. query to query: 2. entity to query Frequency-based approach The more queries entities share 3. entity to entity the higher the probability Idea : Run Personalized PageRank on entity-query graphs 47 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Search by multiple examples [Zhu et al., 2014] Main idea: Document examples are used to find topics Action Movies - Mission impossible - Die Hard - … Chuck Norris Search by examples Action Actors Arnold - Bruce Willis Schwarzenegger - Tom Cruise - … … Related topics and documents 48 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Nearest neighbor approach [Zhu et al., 2014] Main Idea: Tb The similarity is an Query aggregation over the B Examples distances between D1 document 𝐸 [ and its nearest query example Tc D3 Centroid A Ta D2 49 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Where we are Relational databases Machine learning Textual data Graphs and networks Challenges and Remarks 50 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Graphs Fact Graph Ontology Tree is A Arnold Person Schwarzenegger actedIN is A subClassOf Terminator Actor Release 1984 Budget $6.4M Length 1h 48m 51 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Graphs is A Arnold Person Schwarzenegger actedIN is A subClassOf Terminator Actor RDF (subject,predicate,object) (Arnold_Schwarzenegger,isA,Person) (Actor, subClassOf, Person) (Arnold_Schwarzenegger, actedIn, Terminator) Fact Graph Ontology Tree 52 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Exemplar Queries [Mottin et al., 2014] Input: 𝑅 𝑓 , an example element of interest Nodes/Entities Output: set of elements in the desired result set Edges/Facts Structures Exemplar Query Evaluation • evaluate 𝑅 𝑓 in a database D, finding a sample S • find the set of elements A similar to S given a similarity relation 53 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Exemplar Queries [Mottin et al., 2014] Input: 𝑅 𝑓 , an example element of interest Nodes/Entities Output: set of elements in the desired result set Edges/Facts Structures Exemplar Query Evaluation • evaluate 𝑅 𝑓 in a database D, finding a sample S • find the set of elements A similar to S given a similarity relation • [OPTIONAL] return only the subset A R that are relevant 54 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Mediator Nodes Entity Search Path Queries Entity Tuples [Ruchansky’15] [Metzger’13, [Bonifati’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] CHALLENGE: DISCOVER USER PREFERENCE CHALLENGE: EFFICIENT SEARCH 55 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
The Minimum Wiener Connector Problem [Ruchansky, et al., 2015] Model: Unlabeled Undirected Graph Query: A set of Nodes Q Similarity: Shortest-Path distance Output: A Set of Connector Nodes H “ explains ” connections in Q Connectors: Case: Infected Patients Nodes with HIGH closeness → Culprit/Other Infected to ALL the inputs Case: Target Audience Similar to a Steiner-Tree but → Influencers overall pairwise distances are optimized 56 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
The Minimum Wiener Connector Problem [Ruchansky et al., 2015] Model: Unlabeled Undirected Graph Sometimes The Query: A set of Nodes Q Best Solution is NOT A Tree Similarity: Shortest-Path distance W=1+2+1 =4 Output: A Set of Connector Nodes H NP-Hard minimize the sum of pairwise shortest-path-distances between nodes in the connector H W=1+1+1 = 3 Called: Wiener Index . X d ( u, v ) min tradeoff between size ( u,v ) ∈ H d(u, v) is the shortest-path distance and average distance 57 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Approximate minimum Wiener Index [Ruchansky et al., 2015] Connector Approximated with CHOOSE r & λ ∈ [ 1, log (1+ β ) |V| ] Edge-Weighted SteinerTree All Pairwise Distances Enumerate Candidate Solutions for r ∈ Q & λ Distances from a root r and keep best Measure distance in H Precomputed distance in G r Edge Weights w(u, v) = λ + max { d G ( r, u ) , d G ( r, v ) } λ 58 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Focused Clustering and Outlier Detection [Perozzi et al., 2014] PhD PhD NYC College Model: Unlabeled Undirected Graph NYC English Paris Greek with Node Attributes Google Dutch SAP Query: A set of Nodes Q Google Similarity: Attribute Values & Connectivity ( to be inferred ) College Output: Clusters of Nodes: Dense & Coherent NYC English +Cluster Outliers PhD Google NYC Italian Case: Target Users → Community with same interests PhD IBM NYC French Case: Products → Co-purchased products with similar features SAP 59 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Focused Clustering and Outlier Detection [Perozzi et al., 2014] PhD PhD NYC College TASK: Infer “FOCUS” , important attributes NYC English Paris Greek attribute weights β Google Dutch SAP Google 0.5 PhD PhD 0.5 NYC NYC 0 French English 0 SAP College Google NYC English 1. Set of similar pairs, PS (from Q) PhD Google NYC 2. Set of dissimilar pairs, PD (random sample) Italian PhD IBM 3. Learn a distance metric between PS and PD NYC French ( Distance Metric Learning, inverse Mahalanobis distance: Xing, et al 2002) SAP 60 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Focused Clustering and Outlier Detection [Perozzi et al., 2014] LOCAL TASK: Extract Clusters on Focused Graph clusters attribute weights β -> Edge Weight 1. Find Starting Set of Candidates 1.a Drop low-weight edges 1.b Extract Strongly Connected Component C 1, C 2, … 2. Grow Clusters around Candidates Seed 2.a Compute conductance of C: φ (w) (C, G) 2.b Select node to add to C’ : best improvement to ∆φ (w) (C,C’) (greedy) 2.c Prune Underperforming nodes 3. Detect Outliers: High unweighted conductance 61 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] ✓ 62 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
iQBEES: Entity Search by Example [Metzger et al., 2013, Sobczak et al., 2015] Entity 1: ? Model: Knowledge Graph Entity 2: Query: A set of Entities Q Similarity: shared semantic properties ? Output: A Set of Similar Entities ranked ? Case: Products → Find Similar Products Case: Social Media → User recommendation 63 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Maximal Aspects [Metzger et al., 2013, Sobczak et al., 2015] Adding any aspect ?x type BodyBuilder → E(A)={Arnold} ?x type AmericanActor Include ?x type AmericanActor Typical Types ?x type GovernorCalifornia Prune generic ?x hasHeight 1.88m use most aspects ?x type Entity specific type ?x type AmericanActor Rank REPEATABLE Set of ?x actedIn TheExpendables Update Q aspects ?x type ActionActor 64 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] ✓ ✓ 65 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
Learning Path Queries on Graphs [Bonifati et al., 2015] Model: Edge Labeled Graph Tram + ✓ X Query: 2 sets of Entities Q + , Q - Tram Bus Positive, Negative + Similarity: common path query (RegExp) ✓ - (bus|tram)*Cinema Cinema Output: A Set of Nodes Satisfying some paths(Q + ) but NOT paths(Q - ) S 1 X C 1 Case: Proteins → Similar interactions/co-expression MONADIC: only starting nodes extensible to Case: Tasks Initiator → Similar Processes/Behaviours BINARY/ N-ARY : path from X to Y 66 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Learnability of Path Queries [Bonifati et al., 2015] Query: Q + & Q - (Positive & Negative examples ) Consistency Check: PSPACE-complete Consistecy: 8 v 2 Q + . paths G ( v ) 6✓ paths G ( Q − ) Enumerate Paths 1. Selecting the Smallest Consistent Paths Up Up to Fixed dist stanc nce Infinite Paths? Fix maximal length K but… When to use Kleene star * ? For paths of Length N C | ( A ﹒ B ﹒ C ) → ( A ﹒ B )* ﹒ C K = 2 ⅹ N K N +1 2. Generalize SCP a. Construct Prefix-Tree Acceptor b. Generalize into DFA with Merge PTA DFA 67 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Reverse engineering SPARQL queries [Arenas et al., 2016] Model: Knowledge Graph Spanish Mexico Query: Set of ANSWERS * Haiti Similarity: common AND/OPT/FILTER query Jamaica English Output: A SPARQL QUERY/RESULT Case: Open Data → Query Unknown Schema ?e1 ?e2 M1 Mexico Spanish Case: Novice User → Avoid SPARQL M2 Haiti M3 Jamaica English 68 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Reverse engineering SPARQL queries [Arenas et al., 2016] Enumerate all possible Query: Set of Variable Mappings SPARQL queries satisfied ?X ?Y ?Z by the mappings John M1 INTRACTABLE Mary mary@email.eu M2 M3 Lucy Roses Street Build tree-shaped SPARQL queries IMPLIED by the mappings 69 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Reverse engineering SPARQL queries [Arenas et al., 2016] Query: Set of Variable Mappings Ω {M1,M2,M3,M4} M1 M2 {M3,M4} {M2,M4} M3 M4 {M4} M1 M2 Greedy: keep just M3 enough to cover all M4 variables 70 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] ✓ ✓ ✓ 71 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
Exemplar Queries [Mottin et al., 2014] Query: Model: Knowledge Graph Input: Example Structure Similarity: Isomorphism/Simulation Output: A set of Graphs Knowledge Graph A2 A1 S D. Mottin, M. Lissandrini, T. Palpanas, Y. 72 VLDB 2017 tutorial Velegrakis
NP-complete Computing exemplar queries [Mottin et al., 2014] (subgraph isomorphism) 𝑷 𝑾 𝟓 (simulation) Pruning technique: • Compute the neighbor labels of each node A A A v 𝑋 ?,E,[ = 𝑜 / 𝑚 𝑜 / , 𝑜 + = 𝑏 ∨∈ 𝑂 [-/ 𝑜 X B B • Prune nodes not matching query B A nodes neighborhood labels u Q Sample A1 B • Apply iteratively on the query nodes A2 v neighborhood = {(B,1)} Labels at distance 1 ⊈ No Match u neighborhood = {(A,1)} 73 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
NP-complete Computing exemplar queries [Mottin et al., 2014] (subgraph isomorphism) 𝑷 𝑾 𝟓 (simulation) Approximation: • Nodes closed to the sample are more important v • Use Personalized PageRank with a weighted matrix • Weight edges: frequency of the edge-label Sample A1 A2 74 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Ranking results [Mottin et al., 2014] User Query | | ⇢ ( n s , n ) = � S ( n s , n ) + (1 − � ) v [ n ] CBS Google Yahoo! ⇣ P ⌘ P A2 A1 S Combination of two factors 1. Structural: similarity of two nodes in terms of neighbor relationships 2. Distance-based: the PageRank already computed 75 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Graph query by example (GQBE) [Jayaram et al., 2015] In GQBE Input is a set of (disconnected) entity mention tuples Model: Knowledge Graph Q = (Google, S. Mateo) Input: Entity Tuples Results = Similarity: Isomorphism (Yahoo, S. Clara) (CBS, New York) Output: A set of Tuples 76 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
GQBE: Maximum Query Graph [Jayaram et al., 2015] 1. Find the maximum query graph Q = (v 1 ,v 2 ) • Graph with M edges having the 0.1 0.1 maximum weight 0.2 u 2 v 1 0.4 2. Answers subgraph-isomorphic to 0.7 0.8 0.1 the query graph NP-hard z 0.5 3. Return top-k 0.3 Answer score: v 2 u 1 0.5 • Sum of query graph weights • Similarity match between edges in the answer Maximum Answer and the query (shared nodes take extra credit) Query Graph graph 77 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
Multiple query tuples [Jayaram et al., 2015] Subgraphs of v 1 v 2 Maximum Maximum Query Graph Query graph is Very Large v 1 v 2 v 1 v 2 v 1 v 2 Preserve the query connectivity v 1 v 2 Find answers using a lattice obtained removing edges from the union graph GQBE finds answers for multiple query tuples 1. Compute a re-weighted union graph of the individual query graphs 78 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
SIMILARITY Nodes Structures Structures Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] Do not Include User Feedback 79 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
Where we are Relational databases Machine learning Textual data Graphs and networks Challenges and Remarks 80 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Online exploration of datasets Main idea: Learn the items to show online as more points are acquired Two ways of learning: passive and active items Learn v Is t or ? items v t Learn Passive Active 81 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
MindReader [Ishikawa et al., 1999] Main idea: learn an implicit query from user examples and optional scores Weight Searching “mildly overweighted” patients • The doctor selects examples by q browsing patient database • The examples have “oblique” : good correlation : very good • We can “guess” the implied query Height 82 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
� Learning an ellipsoid distance [Ishikawa et al., 1999] Weighted distance matrix Euclidean 𝐸 𝑦, 𝑟 = 𝑦 − 𝑟 œ 𝑁(𝑦 − 𝑟) q Implicit query ? ? weighted 𝐸 𝑦, 𝑟 = Y Y 𝑛 x` (𝑦 x − 𝑟 x )(𝑦 ` − 𝑟 ` ) Euclidean x ` q Learn the query minimizing the penalty = weighted sum of distances between query point and sample vectors generalized ellipsoid distance 𝑛𝑗𝑜𝑗𝑛𝑗𝑨𝑓 Y 𝑦 [ − 𝑟 œ 𝑁(𝑦 [ − 𝑟) q [ 𝑡𝑣𝑐𝑘𝑓𝑑𝑢 𝑢𝑝 det 𝑁 = 1 83 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Learning the distance [Ishikawa et al., 1999] ❚ Query point is moved towards “good” examples — Rocchio formula in IR Q 0 : query point : retrieved data Q 1 : relevance judgments Q 1 : new query point Q 0 Learning can be done online!!! 84 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Active learning for online query systems [Vanchinathan et al., 2015] Main idea: the system “query” the user to understand her preferences Ask user Get item System preference Learn unknown preferences and minimize the number of questions to the user 85 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
� Learning unknown preferences [Vanchinathan et al., 2015] Problem : Find a set S that maximize the user preference within a budget (e.g., number of interactions) User preferences S (intended user set) arg max Y 𝑞𝑠𝑓𝑔(𝑤) Ž∈ª subject to 𝐷𝑝𝑡𝑢 𝑇 ≤ 𝑐𝑣𝑒𝑓𝑢 Cost for the set S 86 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
� � Background: Gaussian processes [Bishop et al., 2006] Idea : Model the user preferences as a Gaussian Process A Gaussian Process (GP) is an infinite set of variables, any subset of this is Gaussian + exp(− 1 / 2 𝐠 − 𝜈 œ Σ -/ (𝐠 − 𝜈)) Gaussian prior 𝑄 𝐠 Σ, 𝜈 = 2𝜌Σ Specified only by mean and covariance ? Given observations 𝑦, 𝑧 [B/ over an unknown function f drawn from a Gaussian prior, the posterior is Gaussian 𝑄 𝐠 𝐳 ∝ ¹ 𝑒x 𝑄(𝐠, 𝐲, 𝐳) 87 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
GP-Select [Vanchinathan et al., 2015] Learn posterior Trades off exploration exploitation Ask user feedback Exploration: select items with high-variance • • Exploitation: select items with high-value 88 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Active learning on graphs – which prior? [Ma et al., 2015] Idea: Use the graph structure to infer the node classes Use graph Laplacian as prior 𝑀 = 𝐸– 𝐵 , A is the adjacency matrix Laplacian: higher probability of having the same class if two nodes are connected 89 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Explore-by-Example: AIDE [Dimitriadou et al., 2015] Relevance Feedback Relevant Samples Data Classification User Irrelevant Samples Model User Samples Model Query Formulation Space Exploration Sampling queries Data Extraction Query 90 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
The AIDE algorithm [Dimitriadou et al., 2015] 1. Divide the space into d-dimensional cubes 2. Find the sample points in the cubes (medoids) 3. Train the classifier 4. Refine the training sampling from neighbors of misclassified points 5. Boundary refinement 91 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Classification & Query Formulation [Dimitriadou et al., 2015] Sample Red Green Relevant red Object A 13.67 12.34 Yes red>14.82 red<=14.82 Object B 15.32 14.50 No red .. .. .. ... Irrelevant Object X 14.21 13.57 Yes red<13.55 red>=13.55 green Irrelevant green>13.74 green<=13.74 Irrelevant Relevant Decision Tree Classifier SELECT * FROM galaxy WHERE red<= 14.82 AND red>= 13.5 AND green<=13.74 92 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
Misclassified Sample Exploitation [Dimitriadou et al., 2015] Sampling x Areas x x √ √ x √ √ √ √ √ √ √ x Red wavelength x x x x x x x √ x √ x x x x Green Wavelength 93 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis
Clustering-based Sampling [Dimitriadou et al., 2015] √ x √ x √ x √ √ √ √ √ Idea : Use a k-medoid x √ √ √ √ x x √ √ √ x approach to find sampling areas Red wavelength Clusters- Sampling Areas x √ √ Green Wavelength 94 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis 9 9 4 4
Where we are Relational databases Machine learning Textual data Graphs and networks Challenges and Remarks 95 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Example-based methods Query suggestion Entity extraction Community- • • • using examples by example text based Node- Reverse Web table retrieval • • engineering completion using Entity Search • queries examples Path and SPARQL • Search by queries • example Graph structures • as Examples 96 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
Example-based methods: takeaways Graph Textual Relational Complex search Exploit locality • • space • Allows serendipitous Entity attributes are • search • Exact and expressive approximate Easier document • • Reverse finding Interactivity can engineering: good • improve the quality approximations Speed up entity • matching Limited to query Large result-sets • • inference require ranking 97 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial
The use of examples Examples can ease data exploration • … reduce need for complex queries / simplify user input • … require no schema knowledge • … allow uncertainity in search conditions • … require little data analytics expertise 98 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
Where should we invest time Approximate Machine Methods learning User models Scalability 99 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial
ADOPT HETEROGENEITY Need for solutions that operate across different models operate on heterogeneous datastores 100 D. Mottin, M. Lissandrini VLDB 2017 tutorial
Recommend
More recommend