A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master thesis 17.03.2006
Overview I A Semantic Similarity Model and Algorithm Motivation - Heterogeneous Data Ontologies Similarity Measures Hybrid Model & Similarity Calculation Application Evaluation of the Model and Algorithm Summary
Heterogenous Data Hello, is this the Je m’appelle Jane et je train station? t’emmerde Frische T omaten! Frische T omaten! Heterogeneous data sources are the norm Integration poses two main problems Syntactic differences Semantic differences
Integration = Matching I have a Forest: Syntactic Semantic thousand CDs A wooded area Forest: J’ai 1000 Land that belongs to the disques forestry commission Integration depends on finding matches Syntactic problems Semantic problems Matching requires similarity and similarity measure
Ontologies Shack Villa Houses Industry Houses Iron Foundry T extiles Iron Foundry Shack Villa Industry T extiles Ontology is the study of the existence of entities Shared specification of domain knowledge Ontologies are one way to encode semantics T ool of choice for the Semantic Web
Similarity Measure Provide a means of comparing two entities to determine how similar they are Not based on a cognitive model Description Logics, Word based, Structure based Based on a cognitive model Feature, Network, Cognitive Spaces Hybrid Semantic Similarity Measure
Hybrid Cognitive Model I Red Filled Round Blue Filled Square Combines the approaches of the feature and the network model Basis is the feature model, but each feature has an inner structure in the form of the network model
Hybrid Cognitive Model II Has Surface Forest Has Vegetation Has Use Every class is represented by a set of properties Shared vocabulary is structured hierarchically Property values reference a shared vocabulary Property value ranges are sets of shared vocabulary
Similarity Calculation I Coniferous Forest Broad-leaved Forest Has Surface Has Surface Has Vegetation Has Vegetation Has Use Has Use Similarity of two classes is the aggregate of the similarities of their properties Property similarities can be weighted to emphasise certain aspects
Similarity Calculation II Has Surface Has Surface Has Vegetation Has Vegetation Properties are matched based on their quantifier and name Similarity for two matching properties is the similarity of their ranges
Austria - Realraumanalyse Austria - Realraumanalyse Slovenia - Corine Slovenia - Corine Italy - Moland Italy - Moland Application - HarmonISA
Overview II A Semantic Similarity Model and Algorithm Evaluation of the Model and Algorithm Expert evaluation Shortcomings of the model Modelling errors Performance analysis Summary
Expert Evaluation I Mappings evaluated by domain experts Realraumanalyse => Corine 136 total / 116 correct / 20 incorrect Corine => Realraumanalyse 64 total / 34 correct / 30 incorrect Incorrect mappings grouped by reasons Shortcomings of the model Modelling errors Correct but reclassified
Model Shortcomings I Knee timber Vegetation Surface refers to Knee timber Alpine turf Vegetation Rocks 90% : 10% 80% : 20% Non built-up areas belonging to the public administration No negation possible Knee timber partially with rocks and alpine turf Internal structure and relations between properties can’t be defined
Model Shortcomings II Alluvial Forrest River No relations between concepts in the land-use ontologies Workaround via special properties such as “Lies next to”
Model Shortcomings III Elevation Greenland Mountain Sub alpine Alpine and Woods Pasture higher than No relations between concepts in the skeleton ontology
Modelling Errors Additional incorrect knowledge specified Bare Rocks which included a value for the property Vegetation Knowledge left out or none specified Green Urban Areas which somehow managed to only have one property specified Incorrect metadata Incomplete settlement along a road which in the metadata was specified as belonging to the continuous urban fabric and was thus modelled as such
Reclassification of concepts Correctly mapped to the most similar concept, but would be handled different by the experts Sea and Ocean, Olive Groves, Annual crops associated with permanent crops Suggested strategy for dealing with these Leave them out. Create no mapping Reclassify based on additional knowledge Some knowledge could be added to the system Some knowledge basically a hunch
Expert Evaluation II Initial evaluation result not too good Realraumanalyse => Corine: 85% correct Corine => Realraumanalyse: 53% correct Analysis of errors revealed (out of a total 200 mappings): 3 erroneous mappings due to model shortcomings 17 erroneous mappings due to modelling errors 30 reclassifications of correct mappings
Expert Evaluation III Modelling errors can be corrected Reclassifications are not actual errors but differing methodologies Updated number of correct mappings Realraumanalyse => Corine: 134 out of 136 (98%) Corine => Realraumanalyse: 63 out of 64 (98%) Analysis of the evaluations of the other mappings reveals an average error rate between 0 and 5%
Performance Analysis I Corine Realraum Agricultural Artificial Agricultural Settlement Arable Pastures Arable Dense T ransport Source T arget Every source concept is matched to each target concept and then the best is selected.
Performance Analysis II Heuristics OWL Hierarchy Mapping Static DL Reasoning Similarity Calculation Hierarchy T otal complexity of the similarity calculation: Polynomial time (O(N 5 )) Loading and hierarchy calculation in Description Logics: Exponential time Optimisation required for larger ontologies Removing the Description Logics reasoning Heuristics / Parallelisation for the similarity calculation
Performance Analysis III Ontology # Concepts Avg # Prop. Load Time Corine 64 3 31sec Moland 96 5 3min 19sec Realraumanalyse 136 6 5min 16sec From / T o Corine Moland Realraumanalyse Corine 5sec 10sec 15sec Moland 11sec 20sec 31sec Realraumanalyse 18sec 34sec 52sec
Overview III A Semantic Similarity Model and Algorithm Evaluation of the Model and Algorithm Summary
Summary Cognitive model is capable of describing most real- world situations Similarity algorithm works sufficiently well to be used in real-world situations (average correctness of above 95%) Performance is the major bottleneck. Without improvement it is unusable for larger ontologies Cognitive model needs to be extended in some areas
Statistics 101 pages (a nice prime number) 77 pages with actual content 24 pages of structural padding 6 Chapters (average 12.8 pages per chapter) 29208 words Average of 379 words per page Most frequent word: similarity (239x) 62 Figures and 3 T ables 65 References T otal size: Source: 1.5MB, PDF: 1.4MB
Questions, Comments, Praise Thank you for listening
Recommend
More recommend