multi column substring matching for database schema
play

Multi-column Substring Matching For Database Schema Translation - PowerPoint PPT Presentation

Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Multi-column Substring Matching For Database Schema Translation Robert H. Warren 1 Dr. Frank Wm Tompa 1 1 David R. Cheriton School of Computer Science


  1. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Multi-column Substring Matching For Database Schema Translation Robert H. Warren 1 Dr. Frank Wm Tompa 1 1 David R. Cheriton School of Computer Science University of Waterloo Waterloo, Canada The 32nd Very Large Database Conference, 2006 Warren & Tompa Multi-column Substring Matching

  2. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Database Integration Objective A generalisable method capable of resolving complex schema matches and the translation required to convert the instance data using substrings concatenation. Example 1 leftmost characters of column lastname + 2 rightmost characters of column birthdate → column userid Name in database D → First + Last in database D ′ 2005/05/29 in database D → 05/29/2005 in database D ′ PartNumber in database D → Number + PlantId + 2 rightmost digits in Year. Warren & Tompa Multi-column Substring Matching

  3. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Database Integration Why is this an important problem? Issues: ...the number and size of databases growing. (+10,000 tables, +1,600 columns) ...integration is an every day issue. (Semantic web, smart clients, dynamic data sources...) ...multiple standards in use. (22 Locales) ...previously we have used top-down approaches. Here we use a data-driven, bottom up approach. ⇒ Need automation to deal with this problem. Warren & Tompa Multi-column Substring Matching

  4. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Database Integration Previous work Rahm and Bernstein present a good taxonomy and discussion of matching problem. [RB01b, RB01a] Basic support for concatenating complete columns in the CUPID system. [MBR01] Embley et al. made use of ontologies to discover such translations. [EXD04] To deal with complex cases, Doan et al. proposed “format learners”. [DDH01] The IMAP system makes use of specific matchers for mathematical relationships. [DLD + 04] ⇒ No known generalisable solution to high-cardinality ( n -to-1), substring concatenations schema translations. Warren & Tompa Multi-column Substring Matching

  5. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Requirements and concerns Automated, un-supervised and data driven. 1 Offload as much of the work to the databases. [KMS04] 2 Client side discovery process, bandwidth << database 3 contents. Part of a larger, automated, database integration system: 4 partial notion of what could/should be a match. “Entity” overlap between database tables unknown but 5 present. Warren & Tompa Multi-column Substring Matching

  6. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Problem formalization Definition For a given target database table T 2 with a target column A ...and a source table T 1 with a set of likely source columns ( B 1 , B 2 , ..., B n ) Find a transformation such that: A = ω 1 + ω 2 + · · · + ω ν Where ω i represents a substring of column B i Translation model t ′ = t [ β x 1 ... y 1 + β x 2 ... y 2 + · · · + β x ν ... y ν ] (chars x ν . . . y ν of col B ν ) 1 2 ν Warren & Tompa Multi-column Substring Matching

  7. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . h kerry robert nawisema . . . kyle s norma jlmalton . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  8. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . h kerry robert nawisema . . . kyle s norma jlmalton . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  9. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 How to infer translation? select substring(first from 1 for 1) || first middle last ??? Login substring(middle from 1 for 1) || last as login . . . h kerry robert nawisema into target_table from source_table. . . . kyle s norma jlmalton . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  10. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 Solution: first middle last ??? Login Iteratively select substrings from “best-fit” . . . h kerry robert nawisema columns while performing a simple form of . . . kyle s norma jlmalton record linkage. . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  11. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (1) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton . . . norma a wiseman rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  12. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 1 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . match . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  13. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 4 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . matches . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  14. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 2 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . matches . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  15. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 4 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . matches . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  16. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 Column Scoring Formula q first middle last ??? Login   t HitCount ( j ) ScoreCol = � . . . robert h kerry nawisema (1) t ∗ length ( key j )   . . . kyle s norma j = 1 jlmalton . . . norma a wiseman rhkerry Where: . . . . . . . . . . . . . . . t is the number of sampled values from source column. . . . amy l case alcase key j is the j -th sampled value. . . . a alderman josh ksokmoan HitCount is the number of q -gram matches for key j . . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

Recommend


More recommend