from example data using ilp
play

From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt - PowerPoint PPT Presentation

Schema Mapping Discovery From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt Data and Web Science Group University of Mannheim Problem Description Target DB Source DB Transformation ? f T = f(S) S Learn Algorithm Example


  1. Schema Mapping Discovery From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt Data and Web Science Group University of Mannheim

  2. Problem Description Target DB Source DB Transformation ? f T = f(S) S Learn Algorithm

  3. Example 1 (GAV Mappings) S T Vegetable ID, Name, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient(ID, Name, E, C, F, P) AnimalProduct ID, Name, ๐Ÿ๐ก๐ก, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient(ID, Name, E, C, F, P)

  4. Problem In ILP Framework Components โ–ช Background Knowledge B : Tuples from relations of source DB โ–ช (Constraints like Key Relationships, Column Data Types/Domain) โ–ช Positive Examples E + : Tuples from relation(s) in target DB โ–ช Negative Examples E - : Any Tuples not in relations of target DB (CWA) โ–ช Hypothesis H: Set of Schema Mapping Rules B = { Vegetable(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž11โ€œ), NutritionalValue(โ€ž11โ€œ, โ€ž16โ€œ, โ€ž2.7gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} E + = { VegetableIngredient(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} E - = { VegetableIngredient(โ€ž017โ€œ, โ€žBeef (minced)โ€œ, โ€ž208โ€œ, โ€ž20.5gโ€œ, โ€ž0gโ€œ, โ€ž14gโ€œ), VegetableIngredient(โ€ž032โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ}

  5. Problem In ILP Framework Components โ–ช Background Knowledge B : Tuples from relations of source DB โ–ช Positive Examples E + : Tuples from relation(s) in target DB โ–ช Negative Examples E - : Any Tuples not in relations of target DB โ–ช Hypothesis H: Set of Schema Mapping Rules B = { Vegetable(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž11โ€œ), NutritionalValue(โ€ž11โ€œ, โ€ž16โ€œ, โ€ž2.7gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} E + = { VegetableIngredient(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} E - = { VegetableIngredient(โ€ž017โ€œ, โ€žBeef (minced)โ€œ, โ€ž208โ€œ, โ€ž20.5gโ€œ, โ€ž0gโ€œ, โ€ž14gโ€œ), VegetableIngredient(โ€ž032โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} Solution Constraints Requirements โœ“ Posterior Sufficiency: ๐ถ โ‹€ H โŠจ ๐น + โœ“ Prior Satisfiability: ๐ถ โ‹€ ๐น โˆ’ โŠญ false (Produce all target tuples) โœ“ Prior Necessity: B โŠญ ๐น + โœ“ Posterior Satisfiability: ๐ถ โ‹€ H โ‹€ ๐น โˆ’ โŠญ false (Produce only target tuples)

  6. าง Problem In ILP Framework Components โ–ช Background Knowledge B : Tuples from relations of source DB โ–ช Positive Examples E + : Tuples from relation(s) in target DB โ–ช Negative Examples E - : Any Tuples not in relations of target DB โ–ช Hypothesis H: Set of Schema Mapping Rules Language Bias โ–ช GLAV (Global-And-Local-As-View) Schema Mappings (Source-to-Target Tuple-Generating Dependencies) โˆ€เดฅ ๐‘ฆ ๐œ’ ๐‘ฆ โ†’ โˆƒเดค ๐‘ง ๐œ”( าง ๐‘ฆ, เดค ๐‘ง)

  7. Example 2 (GLAV Mapping) T Person (ID,PName,CID1,EID) โˆง Employer (EID,EName,CID2 โˆง City (CID1,CName1,Loc1) โˆง City (CID2,CName2,Loc2) โ†’ โˆƒ CoID,GID1,GID2 : Commute (CoID,PName,GID1,GID2) S โˆง GeoLocation (GID1,Loc1) โˆง GeoLocation (GID2,Loc2)

  8. Interesting Problem Dimensions (Future Work) โ–ช Learn Mappings with Constants in Body: โžข AnimalProduct ID, Name, ๐Ÿ๐ก๐ก, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient(ID, Name, E, C, F, P) โ–ช Learn GLAV (not GAV) Mappings: โžข Person(ID,PName,CID1,EID) โˆง Employer(EID,EName,CID2 โˆง City(CID1,CName1,Loc1) โˆง City(CID2,CName2,Loc2) โ†’ โˆƒ CoID,GID1,GID2 : Commute(CoID,PName,GID1,GID2) โˆง GeoLocation(GID1,Loc1) โˆง GeoLocation(GID2,Loc2) โ–ช Learn Mappings with Functions: โžข Vegetable ID, Name, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient ID, Name, ๐  ๐… , C, F, P , ๐‘”: ๐‘“๐‘œ๐‘“๐‘ ๐‘•๐‘ง โ†ฆ ๐‘“๐‘œ๐‘“๐‘ ๐‘•๐‘ง โˆ˜ โ€ฒ๐‘™๐‘‘๐‘๐‘šโ€ฒ

  9. Challenges and Aspects โ–ช Scalability โ–ช At least O(100) different Tables on both sides โ–ช Broad Tables/Predicates (e.g. 80 Columnsโ€ฆ) โ–ช Long Tables (GigaBytes of Data) โ–ช Instance Ambiguity โ–ช Instances are String/Number constants โ–ช Language Bias โ–ช Atypical for ILP โ–ช Existential Quantifier in Head โ–ช Multiple Predicates in Head โ–ช Functions?

  10. Work So Far Variation of Problem โ–ช Fixed Schema โžข โˆ€๐‘ฆ 1 โ€ฆ ๐‘ฆ ๐‘œ : ๐‘ˆ ๐‘‡๐‘๐‘ฃ๐‘ ๐‘‘๐‘“ (๐‘ฆ 1 , โ€ฆ , ๐‘ฆ ๐‘œ ) โ†ฆ ๐‘ˆ ๐‘ˆ๐‘๐‘ ๐‘•๐‘“๐‘ข (๐‘” 1 ๐‘ฆ 1 , โ€ฆ , ๐‘” ๐‘œ (๐‘ฆ ๐‘œ )) โ–ช Transformation Functions on Values (Prefixing, Scaling, Trimming etc.) source target key key key user user user flag flag flag key key user user flag flag 0 000 mfink B 903 900 fink dehn C B 000 000 mfink mfink B B 0 1 001 apeter A 901 peter A 001 001 apeter apeter A A 904 fink B 1 2 002 002 002 fschn fschn fschn X X X 902 901 schn peter A X 2 3 003 003 003 cdehn cdehn cdehn C C C 903 900 fink dehn B C 3 4 004 004 004 hfink hfink hfink B B B 904 902 schn fink X B 4

  11. Work So Far Variation of Problem Table Source โ–ช ID TABNAME TABKEY FLAG Fixed Schema Domain Rules: โ–ช Offset Columns 38500 VBAP 8000000001908000001 V [MANDT] 800 -> 904 (Violation of First Normal-Form) 42100 VBAP 8000000001908000003 V [VBELN] .{2}x -> 77y [KUNNR] .{2}x -> 99x 1780 MARA 8000000054765 X 179 MARA 800000000000000000015 Y Field Rules: Table.ID x{0}* -> x Offset Rules: Table Target Table.TABKEY[TABNAME/FLAG] VBAP / V: ID TABNAME TABKEY FLAG MANDT[0-2]|VBELN[3-12] 385 VBAP 9047700001908000001 V MARA / X: MANDT[0-2]|KUNNR[3-12] 421 VBAP 9047700001908000003 V MARA / X: MANDT[0-2] 178 MARA 90499000054675 X 179 MARA 904000000000000000015 Y

  12. Work So Far Data Set โ–ช Artifical Transformation โ–ช 2 different Transformations defined on 12 domains โ–ช 168MB โ–ช 84 Tables with up to 28000 lines โ–ช Solvable in ~1 Minute โ–ช 26 Domain rules โ–ช 14 Key Mappings โ–ช 4 Fixed Value Transformations โ–ช 1 Trimming Transformation โ–ช 7 Mask Overlay Transformations โ–ช 1 Field rule โ–ช Disable domain rule โ–ช 5 Offset Rules โ–ช 1 Control Column โ–ช Up to 4 different Segmentations with up to 2 transformed domains

  13. Work So Far Example Solution โ€“ Domain Rules: KNUMB: IdTableTransformation KUNNR: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] TDOBNAME: FrontMaskTransformation.{NUMERIC} : xxy" -> "99y" ATINN: IdTableTransformation CLINT: IdTableTransformation ADRNR: IdTableTransformation SYBIN2: IdTableTransformation SYBIN1: FixedValueTransformation.{NUMERIC,TEXT} : x -> "0" USNAM: FixedValueTransformation.{NUMERIC,TEXT} : x -> "ANONYM" J_OBJNR: FrontMaskTransformation.{NUMERIC,TEXT} : xxxxy" - > "xx99yโ€ž AD_SO_KEY: IdTableTransformation FPLNR: IdTableTransformation

  14. Work So Far Example Solution โ€“ Domain Rules (cont): INT2: RearTrimTransformation.{NUMERIC,TEXT} : xxx[.]{3} -> xxx AD_ADDRNUM: IdTableTransformation VBELN: FrontMaskTransformation.{NUMERIC} : xxy" -> "99y" PARNR: IdTableTransformation CK_KALNR: IdTableTransformation PERNR: IdTableTransformation XUBNAME: FixedValueTransformation.{NUMERIC,TEXT} : x -> "ANONYM" MANDT: FixedValueTransformation.{NUMERIC,TEXT} : x -> "904" NA_OBJKEY: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] AD_PERSNUM: IdTableTransformation QMNUM: IdTableTransformation NA_PARNR: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] KUNDE: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000]

  15. Work So Far Example Solution (Offset Rules) โ–ช CDHDR.OBJECTID: [OBJECTCLAS] -> OBJECTID: [DEBI] KUNNR[0-10) | UnchangesSegments(11,..) [VERKBELEG] VBELN[0-10) | UnchangesSegments(11,..) [BANK] MANDT[0-3) | UnchangesSegments(4,..) [MATERIAL] UnchangedSegments

  16. Questions / Ideas?

Recommend


More recommend