coma coma a system for flexible system for flexible
play

COMA COMA A system for flexible system for flexible combination - PowerPoint PPT Presentation

COMA COMA A system for flexible system for flexible combination of schema matching combination of schema matching approaches approaches Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.de Content Content


  1. COMA – COMA – A system for flexible system for flexible combination of schema matching combination of schema matching approaches approaches Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.de

  2. Content Content � Motivation � The COMA approach � Comprehensive matcher library � Flexible combination scheme � Novel reuse-oriented match approach � Evaluation setup and results � Conclusions and future work 2

  3. Motivation Motivation � Schema matching: Finding semantic correspondences between two schemas � Crucial step in many applications � Data integration: mediators, data warehouses � E-Business: XML message mapping � ... � Currently manual, time-consuming, tedious � Need for approaches to automate the task as much as possible PO1 PO2 Customer ShipTo DeliverTo BillTo custName custStreet shipToStreet shipToCity Address shipToZip custCity City Street Zip custZip PO1.ShipTo.shipToCity ↔ PO2.DeliverTo.Address.City 3

  4. Individual Match Approaches Individual Match Approaches Instance-based Schema-based Reuse-oriented Element Structure Element Element Structure • Previous • Dictionaries Linguistic Constraint- Constraint- Constraint- Linguistic match results • Thesauries based based based • Value pattern • Names • Types • Parents • IR (word and ranges • Descriptions • Keys • Children frequencies, • Leaves key terms) Survey paper [Rahm, Bernstein - - VLDB Journal’01] VLDB Journal’01] Survey paper [Rahm, Bernstein 4

  5. Combining Match Approaches Combining Match Approaches � Combination of match algorithms � Hybrid : fixed combination, difficult to extend and improve � currently most common: Cupid, SemInt, SimilarityFlooding, DIKE, MOMIS, TranScm � Composite : combination of the results of independently executed matchers � currently only for machine learning-based techniques: LSD, GLUE � COMA: Framework for flexible CO mbination of MA tch algorithms � Extensible matcher library � Combination scheme with various combination strategies 5

  6. System Architecture System Architecture Matcher Library Schema Import Match Iteration S1 Similarity cube Combination of match results Matcher 1 S1 → S2 UserFeedback Matcher 2 S2 → S1 Mapping Matcher 3 Matcher execution S2 User Interaction Combination (optional) Scheme 6

  7. Combination Scheme Combination Scheme 1. Aggregation Average, SmallLarge, 2. Match of matcher- Max, Min, LargeSmall, direction specific results Weighted Both S2 S1 → S2 Matchers S2 s1 s2 0.8 ... ... ... [ S1 , S2 , 0.7] S1 Combined S2 → S1 S1 similarity s2 s1 0.8 ... ... ... Similarity matrix Match results Similarity cube 4. Computation MaxN (Max1), Dice, of combined 3. Selection of Threshold, similarity Average match MaxDelta, candidates Threshold+MaxN, Threshold+MaxDelta 7

  8. Match Processing: Example Match Processing: Example S1 S2 Matcher1: 0.6 1. Aggregation Matcher2: 0.8 shipToCity City S2 S1 shipToStreet Matcher1: 0.8 Average : 0.7 Matcher2: 0.4 shipToCity City shipToStreet Average : 0.6 2. Direction LargeSmall | S1 |>| S2 | S2 elements S1 elements Sim (Match candidates shipToCity 0.7 City for smaller schema S2) shipToStreet 0.6 3. Selection Max1 Threshold (0.5) S2 elements S1 elements Sim S2 elements S1 elements Sim City shipToCity 0.7 City shipToCity 0.7 City shipToStreet 0.6 8

  9. Matcher Library Matcher Library Type Matcher Schema Info Auxiliary Info Constituent Matchers Simple Affix Element names – – n-gram Element names – – Soundex Element names – – EditDistance Element names – – Synonym Element names External dictionaries – DataType Data types Data type compatibility table – UserFeedback – User-specified (mis-) matches – Hybrid Name Affix, 3-Gram, Element names – Synonym TypeName DataType, Name Data Types+Names – NamePath Name Names+Paths – Children TypeName Child elements – Leaves TypeName Leaf elements – Reuse- Schema – Existing schema-level match – oriented results 9

  10. Reuse-oriented Matching Reuse-oriented Matching S1 S S2 m1 m2 0.8 firstName FName 0.6 Name 0.7 0.6 lastName LName S1 S2 m m = 0.7 firstName MatchCompose Name (m1, m2) 0.65 lastName � The MatchCompose operation: Transitivity of element similarity � Composition of similarity relationships � Reuse of multiple match correspondences � vs. reuse of single element-level correspondences from synonym tables, thesauries 10

  11. Schema-level Reuse Schema-level Reuse � The Schema matcher: Aggregation Direction Match- Search Selection Compose repository S1 ↔ Si , S2 ↔ Si S1 ↔ S2 S1 ↔ Sj , Sj ↔ S2 S1 ↔ S2 Match Match result problem Sk ↔ S1 , S2 ↔ Sk Similarity cube Existing match results � Reuse complete match results at the schema level � Exploit all possible reuse opportunities � Limit negative effects of transitivity 11

  12. Real-world Evaluation Real-world Evaluation � 5 real-world schemas (XML – Purchase order), 10 match tasks CIDX, Excel, Noris, Paragon, Apertum from biztalk.org � 40-145 elements � � Systematic evaluation (automatic mode) 1 Series = 10 Experiments : Test of 1 configuration of (Matcher, Aggregation, � Direction, Selection, Combined similarity) with 10 match tasks 12,312 series = 123,120 experiments � Matchers Aggregation Direction Selection Combined Sim No 5 single - LargeSmall - MaxN (1-4) - Average reuse - SmallLarge - Delta (0.01-0.1) - Dice 11 - Max - Both - Threshold (0.3-1.0) combinations - Average - Threshold (0.5)+ - Min MaxN (1-4) Reuse 2 single - Threshold (0.5)+ Delta (0.01-0.1) 12 - Max - Average combinations - Average - Min Σ = 16 + 14 3 3 36 2 12

  13. Match Quality Measures Match Quality Measures � Comparison of automatically with manually (i.e. real) derived match correspondences Real matches Suggested matches A: Missed matches A B C B: Correct matches C: False matches � Quality measures: B B = = Precision Recall + + B C A B + −  −  A C B C 1 = − = = Overall 1 Recall *  2  SimilarityFlooding [ICDE02]: + + Precision A B A B   � Overall : post-match effort to add missed and to remove false matches; negative Overall → no gain � Computed for single experiments and averaged over 10 experiments for each series (average Overall , etc.) 13

  14. Results: Combination Strategies (1) Results: Combination Strategies (1) Average is used by all 7077 #All Series = 8208 300 series with average 270 270 Overall > 0.6 240 210 207 179 #Series 180 Aggregation (2376 series/strategy) 100% 136 150 160 90% 120 Max 114 80% 90 Min 62 60 70% 30 0 3 Series share 60% 0 50% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 40% Overall 30% � Most no-reuse series have 20% Average 10% negative average Overall 0% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall � Aggregation : Average � “Good” matcher/strategy: (compensating) � Positive average Overall � High presence in higher Overall ranges 14

  15. Results: Combination Strategies (2) Results: Combination Strategies (2) Direction (2736 series/strategy) 100% � Direction : Both (considering both 90% SmallLarge directions) 80% � Selection : Threshold+Delta (above 70% LargeSmall Series share 60% threshold + within tolerance) 50% � Combined similarity : Average 40% (pessimistic) 30% Both 20% � Matcher : All (combination of all 10% hybrid matchers) 0% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Computation of combined similarity Best selection (228 series/strategy) 100% (4104 series/strategy) 100% Thr(0.8) MaxN(1) 90% 90% 80% Dice 80% 70% 70% Series share Thr(0.5)+MaxN(1) Series share 60% 60% 50% 50% 40% 40% 30% Delta(0.02) 30% Average 20% 20% Thr(0.5)+Delta(0.02) 10% 10% 0% 0% Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Overall 15

  16. Results: Single Matchers Results: Single Matchers a) Single matchers 1 No reuse Reuse 0,9 SchemaM : Schema with manually 0,8 0,7 derived (real) match results 0,6 0,5 0,4 SchemaA : Schema with match 0,3 0,2 results automatically derived using 0,1 0 the default match operation -0,1 s e e M A h n -0,2 e t m m e a a a v r m P a a m a d -0,3 N e N e e l e i L h h m e h C c p c a S y S N T avg Precision avg Recall avg Overall � Instability of some single (hybrid) matchers (negative Overall ) because of shared elements � E.g. DeliverTo.Address and BillTo.Address � Considering hierarchical names ( NamePath ) more accurate � Schema-level reuse very effective: Essential improvement over no-reuse hybrid matchers � Reusing approved match results better than automatically derived match results � 16

Recommend


More recommend