Mayank Kejriwal Information Sciences Institute/USC kejriwal@isi.edu http://usc-isi-i2.github.io/kejriwal/
Given one or more attribute-rich graphs, a training set of linked node pairs, how do we avoid evaluating all node pairs (O|V| 2 ) ?
Idea: Candidate Generation via blocking ? Apply blocking key e.g. Tokens(LastName) ? 4 Blocks 3 2 1 ? Generate candidate set (7 pairs), apply ? similarity function Dataset 1 5 on each pair ? ? Dataset 2 ? ‘Exhaustive’ set: 4 X 6=24 pairs
Even better … learn candidate generation function • Doing it efficiently without losing (much) expressive power: Disjunctive Normal Form (DNF) blocking keys • Example: • CharTriGrams(Last_Name) U (Numbers(Address) X Last4Chars(SSN)) • Use functional elements like CharTriGrams to construct complex blocking keys • Optimal search is NP-Complete, use greedy approximation with guarantees
Some results DNF blocking for RDF Attribute Clustering (AC) Name Recall Reduction FMeasure Recall Reduction FMeasure Persons 1 100 99.75 99.88 100 98.86 99.43 Persons 2 99.00 99.79 99.39 99.75 99.02 99.38 Restaurants 100 99.73 99.87 100 95.57 99.79 Eprints-Rexa 98.16 99.28 98.72 99.60 99.37 99.48 IM-Similarity 100 98.14 99.06 100 62.79 77.14 IIMB-059 99.76 93.35 96.45 97.33 73.09 83.49 IIMB-062 47.73 98.11 64.22 77.27 90.80 83.49 Libraries 97.96 99.99 98.96 99.99 99.87 99.93 Parks 95.96 94.41 95.18 99.07 88.27 93.36 Video Game 98.73 99.96 99.34 99.72 99.85 99.79 Average 93.73 98.25 95.11 97.27 91.15 93.53
Recommend
More recommend