Measuring the Structural Similarity of Semistructured Documents - PowerPoint PPT Presentation

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck London, UK . – p.1/18

Introduction XML is everywhere . . . In traditional IR detecting similarities used widely: for querying for clustering Consequently, lots of similarity measures for text documents . – p.2/18

Introduction(2) New challenges with semistructured documents: measuring structural similarity semistructured documents show great structural diversity Measuring structural similarity used for: entity resolution in data cleaning clustering documents before extracting DTD or schema information integrating heterogeneous data sources as a query tool for inexperienced users (query-by-example) . – p.3/18

Measuring Entropy Bennet et al. introduced concept of universal information metric Based on Kolmogorov complexity: given data object x , Kolmogorov complexity K ( x ) is the length of shortest program that outputs x Generalized form is conditional Kolmogorov complexity K ( x | y ) : length of the shortest program with input y that outputs x . – p.4/18

Information Distance Similarity of two data objects can be measured by normalized information distance: max( K ( x | y ) , K ( y | x )) NID ( x, y ) = max( K ( x ) , K ( y )) Has some nice properties: it’s “almost” a metric, lower bound for admissible distances So what’s the catch? . – p.5/18

Information Distance(2) Unfortunately, Kolmogorov complexity is not computable in general However, can be approximated by compression (Cilibrasi and Vitányi): C ( xy ) − min( C ( x ) , C ( y )) NCD ( x, y ) = max( C ( x ) , C ( y )) . – p.6/18

Measuring Structural Similarity Just compressing XML files does not get the job done Extract structural information first: Tags: list element/attribute names in document order Pairwise: like tags, but with names of parents Path: like tags, but with full path to root Family order: family-order traversal of document Except Path, all extractions can be done in linear time . – p.7/18

Measuring Structural Similarity(2) After extracting structural information, we use NCD with gzip Ziv-Merhav crossparsing to come up with similarity measure Can be done in linear time (with suffix trees) . – p.8/18

Competitors Tree-editing distance (Nierman and Jagadish): measuring the minimum editing distance five different edit operations: relabel, insert & delete node, insert & delete (sub-)tree Quadratic runtime . – p.9/18

Competitors(2) Discrete Fourier Transformation (Flesca et al.): encode XML document as a time series rotate document by 90 ◦ , interpret indentations as time series use DFT transform to compute similarity Runtime: N log N ( N size of larger document) . – p.10/18

Competitors(3) Path shingles (Buttler) extract structural information using the Full Path variant compute a hash value h j for each path a shingle of width w is the combination of w consecutive hash values compute similarity between two documents using Dice coefficient on the two sets of shingles Original version is not linear, can be made linear by using different extraction technique . – p.11/18

Clustering Quality Measure quality of doc−DTD1 similarity measure by clustering doc−DTD1 We used hierarchical doc−DTD2 agglomerative clustering doc−DTD1 Quality expressed in doc−DTD1 number of misclusterings in doc−DTD2 dendrogram doc−DTD2 doc−DTD1 . – p.12/18

Document Collections We used three different document collections for experimental evaluation: Real data sets: SIGMOD record, INEX 2005, music sheets encoded in XML Synthetically generated data sets from the DFT paper Own synthetically generated data sets, varying: element names element frequencies element positions element depths . – p.13/18

Overall Results gzip simple 26.1% tree-edit 15.3% tags 17.7% DFT pairwise 20.8% direct ML 22.4% full path 16.9% pairwise ML 19.7% family order 18.9% Shingles Ziv-Merhav tags 20.4% tags 11.7% pairwise 17.8% pairwise 13.8% full path 15.3% full path 11.3% family order 10.6% . – p.14/18

More Detailed Results Different methods have different strengths and weaknesses: tree-edit: generally good, has problems with largely varying document sizes DFT: good at frequencies, bad at element names, position, and depth gzip/Ziv-Merhav: bad at frequencies, good at element names, position, and depth DFT and gzip/Ziv-Merhav are complementary to each other; idea: combine them . – p.15/18

Hybrid Version Hybrid (DFT/Ziv-Merhav) pairw. ML/tags 8.8% pairw. ML/pairw. 12.4% pairw. ML/path 9.7% pairw. ML/family 21.4% Clustering performance becomes even better (except family order) Hybrid approach does not have linear run time . – p.16/18

Conclusion and Outlook Our approach totally different to previous approaches Can be done in linear time (important for large document collections) Possible future work: more sophisticated ways of encoding document structure? different entropy measure better suited for structural information? . – p.17/18

. – p.18/18

Measuring the Structural Similarity of Semistructured Documents - PowerPoint PPT Presentation

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck London, UK . p.1/18 Introduction XML is everywhere . . . In traditional IR detecting similarities used widely: for

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Data and Analysis Part II Semistructured Data Alex Simpson Part II: Semistructured Data Inf1,

Data and Analysis Part II Semistructured Data Ian Stark February 2011 Part II: Semistructured

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Data and Analysis Part II Semistructured Data Ian Stark February 2011 Part II: Semistructured

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December

(Modal) Logics for Semistructured Data (bis) Stphane Demri Laboratoire Spcification et

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

SMR in Linux Systems Seagate's Contribution to Legacy File Systems Adrian Palmer, Drive

How similar are these? 1 Whats the Problem? Finding similar items with respect to some

Statistical Methods for Dating Collections of Historical Documents Michael Gervers University of

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Util ilit ity P y Prop oper ertie ies Guide Guidelin ines Chap apter er 9 9 1 Guide

The Congregation regational al Church h of Plai ainvill ville Capital Campaign Sanctuary

Midterm Review Li Xiong Department of Mathematics and Computer Science Emory University

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)