Pattern Markup-Language Pattern Markup-Language A tool for simplifying data extraction A tool for simplifying data extraction from semi-structured sources from semi-structured sources , Jonathan Baker, Hilton Campbell , Jonathan Baker, Hilton Campbell Jordan Crabtree, David W. Embley Jordan Crabtree, David W. Embley
Many Sites with Genealogical Many Sites with Genealogical Data Data Pattern Markup Language 2 Pattern Markup Language 2
Pattern Markup Language 3 Pattern Markup Language 3
Pattern Markup Language 4 Pattern Markup Language 4
Structural Patterns Structural Patterns Pattern Markup Language 5 Pattern Markup Language 5
Pattern Markup Language 6 Pattern Markup Language 6
Pattern Markup Language 7 Pattern Markup Language 7
Pattern Markup Language 8 Pattern Markup Language 8
Pattern Markup Language 9 Pattern Markup Language 9
Programmer Defined Programmer Defined Regular Expressions Regular Expressions Regular Expression A Pattern Markup Language 10 Pattern Markup Language 10
Programmer Defined Programmer Defined Regular Expressions Regular Expressions Regular Expression B Pattern Markup Language 11 Pattern Markup Language 11
Programmer Defined Programmer Defined Regular Expressions Regular Expressions Regular Expression C Pattern Markup Language 12 Pattern Markup Language 12
Which Relationships Which Relationships Found ? Found ? Death Date Birth Date Given Name Aliases Pattern Markup Language 13 Pattern Markup Language 13
Simple Schema Simple Schema Represents Relationships Represents Relationships Person Birth Death Names Date Date Given Aliases Pattern Markup Language 14 Pattern Markup Language 14
Combine Schema and Combine Schema and Regular Expressions Regular Expressions Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression D Regular Expression C Tree Represented by XML = Tree Represented by XML = PatML PatML Pattern Markup Language 15 Pattern Markup Language 15
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 16 Pattern Markup Language 16
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 17 Pattern Markup Language 17
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 18 Pattern Markup Language 18
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Pattern Markup Language 19 Pattern Markup Language 19
PatML Generation Tools Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D Schema Generator Establishes relationships Pattern Markup Language 20 Pattern Markup Language 20
PatML Generation Tools Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D PatML Editor Helps write the regular expressions and establish which facts they match Pattern Markup Language 21 Pattern Markup Language 21
Pattern Markup Language 22 Pattern Markup Language 22
Using PatML Editor Using PatML Editor Get your schema file Get your schema file Browse for sample page Browse for sample page Add nodes Add nodes Add expressions Add expressions See the highlights in source See the highlights in source Adjust Adjust Pattern Markup Language 23 Pattern Markup Language 23
PatML Editor PatML Editor Tree representing Text area with Interface Interface PatML structure sample page source Browser with Pattern Markup Language 24 Pattern Markup Language 24 rendered sample page
Pattern Markup Language 25 Pattern Markup Language 25
Fast and Versatile Fast and Versatile Regular sites can be integrated Regular sites can be integrated in hours in hours Adaptable to any type of Adaptable to any type of information information Pattern Markup Language 26 Pattern Markup Language 26
Implementation to Date Implementation to Date Genesis uses PatML files to search a variety Genesis uses PatML files to search a variety of sites of sites Searches TNG, Retrospect-GDS, Family Searches TNG, Retrospect-GDS, Family Search, GedCom and Kansas Gunslingers Search, GedCom and Kansas Gunslingers Standardizes information for a common Standardizes information for a common datamodel datamodel Simultaneously searches other sites (in Simultaneously searches other sites (in different formats) for people with similar different formats) for people with similar information information Pattern Markup Language 27 Pattern Markup Language 27
Results Results Pattern Markup Language 28 Pattern Markup Language 28
Results Results Produced PatML that correctly extracts Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, data from TNG, RGDS, GedCom Sites, and Kansas Gunslingers and Kansas Gunslingers User Interface allows for improved User Interface allows for improved debugging environment debugging environment ~1/10 coding time with PatML ~1/10 coding time with PatML generation tools compared to similarly generation tools compared to similarly functioning hand coded parsers functioning hand coded parsers Pattern Markup Language 29 Pattern Markup Language 29
Limitations Limitations Sites must be recognizable with Sites must be recognizable with regular expressions regular expressions Even regular sites have page to Even regular sites have page to page HTML variations page HTML variations Programmer error with regular Programmer error with regular expressions expressions Regular expression operations can be Regular expression operations can be slow slow Pattern Markup Language 30 Pattern Markup Language 30
Future work Future work Automatic regular expression Automatic regular expression generation generation Parsing links to extract data on Parsing links to extract data on connected pages connected pages Use in other applications and fields Use in other applications and fields XPath approaches XPath approaches Pattern Markup Language 31 Pattern Markup Language 31
Recommend
More recommend