pattern markup language pattern markup language
play

Pattern Markup-Language Pattern Markup-Language A tool for - PowerPoint PPT Presentation

Pattern Markup-Language Pattern Markup-Language A tool for simplifying data extraction A tool for simplifying data extraction from semi-structured sources from semi-structured sources , Jonathan Baker, Hilton Campbell , Jonathan Baker,


  1. Pattern Markup-Language Pattern Markup-Language A tool for simplifying data extraction A tool for simplifying data extraction from semi-structured sources from semi-structured sources , Jonathan Baker, Hilton Campbell , Jonathan Baker, Hilton Campbell Jordan Crabtree, David W. Embley Jordan Crabtree, David W. Embley

  2. Many Sites with Genealogical Many Sites with Genealogical Data Data Pattern Markup Language 2 Pattern Markup Language 2

  3. Pattern Markup Language 3 Pattern Markup Language 3

  4. Pattern Markup Language 4 Pattern Markup Language 4

  5. Structural Patterns Structural Patterns Pattern Markup Language 5 Pattern Markup Language 5

  6. Pattern Markup Language 6 Pattern Markup Language 6

  7. Pattern Markup Language 7 Pattern Markup Language 7

  8. Pattern Markup Language 8 Pattern Markup Language 8

  9. Pattern Markup Language 9 Pattern Markup Language 9

  10. Programmer Defined Programmer Defined Regular Expressions Regular Expressions Regular Expression A Pattern Markup Language 10 Pattern Markup Language 10

  11. Programmer Defined Programmer Defined Regular Expressions Regular Expressions Regular Expression B Pattern Markup Language 11 Pattern Markup Language 11

  12. Programmer Defined Programmer Defined Regular Expressions Regular Expressions Regular Expression C Pattern Markup Language 12 Pattern Markup Language 12

  13. Which Relationships Which Relationships Found ? Found ? Death Date Birth Date Given Name Aliases Pattern Markup Language 13 Pattern Markup Language 13

  14. Simple Schema Simple Schema Represents Relationships Represents Relationships Person Birth Death Names Date Date Given Aliases Pattern Markup Language 14 Pattern Markup Language 14

  15. Combine Schema and Combine Schema and Regular Expressions Regular Expressions Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression D Regular Expression C Tree Represented by XML = Tree Represented by XML = PatML PatML Pattern Markup Language 15 Pattern Markup Language 15

  16. Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 16 Pattern Markup Language 16

  17. Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 17 Pattern Markup Language 17

  18. Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 18 Pattern Markup Language 18

  19. Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 19 Pattern Markup Language 19

  20. PatML Generation Tools Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Schema Generator Establishes relationships Pattern Markup Language 20 Pattern Markup Language 20

  21. PatML Generation Tools Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D PatML Editor Helps write the regular expressions and establish which facts they match Pattern Markup Language 21 Pattern Markup Language 21

  22. Pattern Markup Language 22 Pattern Markup Language 22

  23. Using PatML Editor Using PatML Editor  Get your schema file Get your schema file  Browse for sample page Browse for sample page  Add nodes Add nodes  Add expressions Add expressions  See the highlights in source See the highlights in source  Adjust Adjust Pattern Markup Language 23 Pattern Markup Language 23

  24. PatML Editor PatML Editor Tree representing Text area with Interface Interface PatML structure sample page source Browser with Pattern Markup Language 24 Pattern Markup Language 24 rendered sample page

  25. Pattern Markup Language 25 Pattern Markup Language 25

  26. Fast and Versatile Fast and Versatile  Regular sites can be integrated Regular sites can be integrated in hours in hours  Adaptable to any type of Adaptable to any type of information information Pattern Markup Language 26 Pattern Markup Language 26

  27. Implementation to Date Implementation to Date  Genesis uses PatML files to search a variety Genesis uses PatML files to search a variety of sites of sites  Searches TNG, Retrospect-GDS, Family Searches TNG, Retrospect-GDS, Family Search, GedCom and Kansas Gunslingers Search, GedCom and Kansas Gunslingers  Standardizes information for a common Standardizes information for a common datamodel datamodel  Simultaneously searches other sites (in Simultaneously searches other sites (in different formats) for people with similar different formats) for people with similar information information Pattern Markup Language 27 Pattern Markup Language 27

  28. Results Results Pattern Markup Language 28 Pattern Markup Language 28

  29. Results Results  Produced PatML that correctly extracts Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, data from TNG, RGDS, GedCom Sites, and Kansas Gunslingers and Kansas Gunslingers  User Interface allows for improved User Interface allows for improved debugging environment debugging environment  ~1/10 coding time with PatML ~1/10 coding time with PatML generation tools compared to similarly generation tools compared to similarly functioning hand coded parsers functioning hand coded parsers Pattern Markup Language 29 Pattern Markup Language 29

  30. Limitations Limitations Sites must be recognizable with Sites must be recognizable with  regular expressions regular expressions  Even regular sites have page to Even regular sites have page to page HTML variations page HTML variations Programmer error with regular Programmer error with regular  expressions expressions Regular expression operations can be Regular expression operations can be  slow slow Pattern Markup Language 30 Pattern Markup Language 30

  31. Future work Future work  Automatic regular expression Automatic regular expression generation generation  Parsing links to extract data on Parsing links to extract data on connected pages connected pages  Use in other applications and fields Use in other applications and fields  XPath approaches XPath approaches Pattern Markup Language 31 Pattern Markup Language 31

Recommend


More recommend