Schema Mapping Discovery From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt Data and Web Science Group University of Mannheim
Problem Description Target DB Source DB Transformation ? f T = f(S) S Learn Algorithm
Example 1 (GAV Mappings) S T Vegetable ID, Name, NID โง NutrionalValue NID, E, P, C, F โ VegetarianIngredient(ID, Name, E, C, F, P) AnimalProduct ID, Name, ๐๐ก๐ก, NID โง NutrionalValue NID, E, P, C, F โ VegetarianIngredient(ID, Name, E, C, F, P)
Problem In ILP Framework Components โช Background Knowledge B : Tuples from relations of source DB โช (Constraints like Key Relationships, Column Data Types/Domain) โช Positive Examples E + : Tuples from relation(s) in target DB โช Negative Examples E - : Any Tuples not in relations of target DB (CWA) โช Hypothesis H: Set of Schema Mapping Rules B = { Vegetable(โ017โ, โMushroom (white)โ, โ11โ), NutritionalValue(โ11โ, โ16โ, โ2.7gโ, โ0.2gโ, โ2.7gโ), โฆ} E + = { VegetableIngredient(โ017โ, โMushroom (white)โ, โ16โ, โ0.6gโ, โ0.2gโ, โ2.7gโ), โฆ} E - = { VegetableIngredient(โ017โ, โBeef (minced)โ, โ208โ, โ20.5gโ, โ0gโ, โ14gโ), VegetableIngredient(โ032โ, โMushroom (white)โ, โ16โ, โ0.6gโ, โ0.2gโ, โ2.7gโ), โฆ}
Problem In ILP Framework Components โช Background Knowledge B : Tuples from relations of source DB โช Positive Examples E + : Tuples from relation(s) in target DB โช Negative Examples E - : Any Tuples not in relations of target DB โช Hypothesis H: Set of Schema Mapping Rules B = { Vegetable(โ017โ, โMushroom (white)โ, โ11โ), NutritionalValue(โ11โ, โ16โ, โ2.7gโ, โ0.2gโ, โ2.7gโ), โฆ} E + = { VegetableIngredient(โ017โ, โMushroom (white)โ, โ16โ, โ0.6gโ, โ0.2gโ, โ2.7gโ), โฆ} E - = { VegetableIngredient(โ017โ, โBeef (minced)โ, โ208โ, โ20.5gโ, โ0gโ, โ14gโ), VegetableIngredient(โ032โ, โMushroom (white)โ, โ16โ, โ0.6gโ, โ0.2gโ, โ2.7gโ), โฆ} Solution Constraints Requirements โ Posterior Sufficiency: ๐ถ โ H โจ ๐น + โ Prior Satisfiability: ๐ถ โ ๐น โ โญ false (Produce all target tuples) โ Prior Necessity: B โญ ๐น + โ Posterior Satisfiability: ๐ถ โ H โ ๐น โ โญ false (Produce only target tuples)
าง Problem In ILP Framework Components โช Background Knowledge B : Tuples from relations of source DB โช Positive Examples E + : Tuples from relation(s) in target DB โช Negative Examples E - : Any Tuples not in relations of target DB โช Hypothesis H: Set of Schema Mapping Rules Language Bias โช GLAV (Global-And-Local-As-View) Schema Mappings (Source-to-Target Tuple-Generating Dependencies) โเดฅ ๐ฆ ๐ ๐ฆ โ โเดค ๐ง ๐( าง ๐ฆ, เดค ๐ง)
Example 2 (GLAV Mapping) T Person (ID,PName,CID1,EID) โง Employer (EID,EName,CID2 โง City (CID1,CName1,Loc1) โง City (CID2,CName2,Loc2) โ โ CoID,GID1,GID2 : Commute (CoID,PName,GID1,GID2) S โง GeoLocation (GID1,Loc1) โง GeoLocation (GID2,Loc2)
Interesting Problem Dimensions (Future Work) โช Learn Mappings with Constants in Body: โข AnimalProduct ID, Name, ๐๐ก๐ก, NID โง NutrionalValue NID, E, P, C, F โ VegetarianIngredient(ID, Name, E, C, F, P) โช Learn GLAV (not GAV) Mappings: โข Person(ID,PName,CID1,EID) โง Employer(EID,EName,CID2 โง City(CID1,CName1,Loc1) โง City(CID2,CName2,Loc2) โ โ CoID,GID1,GID2 : Commute(CoID,PName,GID1,GID2) โง GeoLocation(GID1,Loc1) โง GeoLocation(GID2,Loc2) โช Learn Mappings with Functions: โข Vegetable ID, Name, NID โง NutrionalValue NID, E, P, C, F โ VegetarianIngredient ID, Name, ๐ ๐ , C, F, P , ๐: ๐๐๐๐ ๐๐ง โฆ ๐๐๐๐ ๐๐ง โ โฒ๐๐๐๐โฒ
Challenges and Aspects โช Scalability โช At least O(100) different Tables on both sides โช Broad Tables/Predicates (e.g. 80 Columnsโฆ) โช Long Tables (GigaBytes of Data) โช Instance Ambiguity โช Instances are String/Number constants โช Language Bias โช Atypical for ILP โช Existential Quantifier in Head โช Multiple Predicates in Head โช Functions?
Work So Far Variation of Problem โช Fixed Schema โข โ๐ฆ 1 โฆ ๐ฆ ๐ : ๐ ๐๐๐ฃ๐ ๐๐ (๐ฆ 1 , โฆ , ๐ฆ ๐ ) โฆ ๐ ๐๐๐ ๐๐๐ข (๐ 1 ๐ฆ 1 , โฆ , ๐ ๐ (๐ฆ ๐ )) โช Transformation Functions on Values (Prefixing, Scaling, Trimming etc.) source target key key key user user user flag flag flag key key user user flag flag 0 000 mfink B 903 900 fink dehn C B 000 000 mfink mfink B B 0 1 001 apeter A 901 peter A 001 001 apeter apeter A A 904 fink B 1 2 002 002 002 fschn fschn fschn X X X 902 901 schn peter A X 2 3 003 003 003 cdehn cdehn cdehn C C C 903 900 fink dehn B C 3 4 004 004 004 hfink hfink hfink B B B 904 902 schn fink X B 4
Work So Far Variation of Problem Table Source โช ID TABNAME TABKEY FLAG Fixed Schema Domain Rules: โช Offset Columns 38500 VBAP 8000000001908000001 V [MANDT] 800 -> 904 (Violation of First Normal-Form) 42100 VBAP 8000000001908000003 V [VBELN] .{2}x -> 77y [KUNNR] .{2}x -> 99x 1780 MARA 8000000054765 X 179 MARA 800000000000000000015 Y Field Rules: Table.ID x{0}* -> x Offset Rules: Table Target Table.TABKEY[TABNAME/FLAG] VBAP / V: ID TABNAME TABKEY FLAG MANDT[0-2]|VBELN[3-12] 385 VBAP 9047700001908000001 V MARA / X: MANDT[0-2]|KUNNR[3-12] 421 VBAP 9047700001908000003 V MARA / X: MANDT[0-2] 178 MARA 90499000054675 X 179 MARA 904000000000000000015 Y
Work So Far Data Set โช Artifical Transformation โช 2 different Transformations defined on 12 domains โช 168MB โช 84 Tables with up to 28000 lines โช Solvable in ~1 Minute โช 26 Domain rules โช 14 Key Mappings โช 4 Fixed Value Transformations โช 1 Trimming Transformation โช 7 Mask Overlay Transformations โช 1 Field rule โช Disable domain rule โช 5 Offset Rules โช 1 Control Column โช Up to 4 different Segmentations with up to 2 transformed domains
Work So Far Example Solution โ Domain Rules: KNUMB: IdTableTransformation KUNNR: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] TDOBNAME: FrontMaskTransformation.{NUMERIC} : xxy" -> "99y" ATINN: IdTableTransformation CLINT: IdTableTransformation ADRNR: IdTableTransformation SYBIN2: IdTableTransformation SYBIN1: FixedValueTransformation.{NUMERIC,TEXT} : x -> "0" USNAM: FixedValueTransformation.{NUMERIC,TEXT} : x -> "ANONYM" J_OBJNR: FrontMaskTransformation.{NUMERIC,TEXT} : xxxxy" - > "xx99yโ AD_SO_KEY: IdTableTransformation FPLNR: IdTableTransformation
Work So Far Example Solution โ Domain Rules (cont): INT2: RearTrimTransformation.{NUMERIC,TEXT} : xxx[.]{3} -> xxx AD_ADDRNUM: IdTableTransformation VBELN: FrontMaskTransformation.{NUMERIC} : xxy" -> "99y" PARNR: IdTableTransformation CK_KALNR: IdTableTransformation PERNR: IdTableTransformation XUBNAME: FixedValueTransformation.{NUMERIC,TEXT} : x -> "ANONYM" MANDT: FixedValueTransformation.{NUMERIC,TEXT} : x -> "904" NA_OBJKEY: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] AD_PERSNUM: IdTableTransformation QMNUM: IdTableTransformation NA_PARNR: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] KUNDE: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000]
Work So Far Example Solution (Offset Rules) โช CDHDR.OBJECTID: [OBJECTCLAS] -> OBJECTID: [DEBI] KUNNR[0-10) | UnchangesSegments(11,..) [VERKBELEG] VBELN[0-10) | UnchangesSegments(11,..) [BANK] MANDT[0-3) | UnchangesSegments(4,..) [MATERIAL] UnchangedSegments
Questions / Ideas?
Recommend
More recommend