Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About Genealogical Records: A Genealogical Records: A Prototype Prototype By By Charla J. Woodbury,* David W . Embley,* Stephen W . Charla J. Woodbury,* David W . Embley,* Stephen W . *Department of Computer Science Liddle** *Department of Computer Science Liddle** **Information Systems Department **Information Systems Department Brigham Young University Brigham Young University April 28, 2010 April 28, 2010 1
Digital Images – Human Digital Images – Human Index Index • Large number of competing family history websites •Digital images •Human indexes • Researchers hunting through records and indexes to put families together 2 2
Problem Problem Large amounts of primary genealogical Large amounts of primary genealogical data data Big projects to index and extract records Big projects to index and extract records Two independent indexers and Two independent indexers and adjudication adjudication Millions of human hours used to index or Millions of human hours used to index or match records for names and families match records for names and families 3 3
Automated Extraction Automated Extraction Solution Solution Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical data genealogical data Add rules and logic that Add rules and logic that Label family roles - husband, daughter, Label family roles - husband, daughter, etc. etc. Link family relationships Link family relationships HUSBAND – WIFE HUSBAND – WIFE 4 PARENT – CHILD PARENT – CHILD 4
Outline Outline 1. Data Preparation Data Preparation 1. 2. Ontology Extraction System Ontology Extraction System 2. (OntoES) (OntoES) 3. OWL File and SWRL Rules OWL File and SWRL Rules 3. 4. SPARQL Queries SPARQL Queries 4. 5. Experimental Results Experimental Results 5. 6. Conclusions Conclusions 6. 5 5
1. Data Preparation 1. Data Preparation Collect machine-readable records from Collect machine-readable records from three difgerent countries three difgerent countries Format in HTML format for extraction Format in HTML format for extraction Prepare lexicons for names, places, etc. Prepare lexicons for names, places, etc. 6 6
New England Vital Records New England Vital Records – Beverly, Massachusetts – Beverly, Massachusetts 1668-1849 1668-1849 7 7
Danish Parish – Maglebye, Praesto 1646-1813 8 8
English Parish – South English Parish – South Petherton, Somersetshire Petherton, Somersetshire 1574-1901 1574-1901 9 9
SOUTH PETHERTON SOUTH PETHERTON MARRIAGES (from genuki) MARRIAGES (from genuki) same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts 10 10
2. Ontology Extraction 2. Ontology Extraction System System OntoES : automatically interpret and OntoES : automatically interpret and correctly label genealogical data correctly label genealogical data using using Data frames Data frames Regular expressions Regular expressions Lexicons Lexicons Date conversion methods Date conversion methods 11 11
Marriage Ontology Marriage Ontology 12 12
Data Frame Editor Data Frame Editor 13 13
Sample MONTH MONTH Sample LEXICON LEXICON decembr decembr 1Ober 1Ober decembre decembre 7ber 7ber decembri decembri 8ber 8ber feb feb 9ber 9ber febr febr apr apr februari februari april april february february aprilis aprilis jan jan aug aug januarij januarij august august january january augusti augusti jul jul augustus augustus juli juli avr avr julius julius avril avril july july avrilis avrilis jun jun dec dec june june december december 14 14
Object Level Object Level 15 15
CONVERSION METHODS CONVERSION METHODS inside the ontology inside the ontology Regularize date (Julian format: Regularize date (Julian format: YYYYddd) ) YYYYddd → 1620093 1620 2-May → 1620093 1620 2-May Display stored Julian format as DD MMM YYYY → 1620093 → 2 MAY 1620 16 16
Feast Dates Feast Dates Fixed Dates Fixed Dates → → Christmas 1720 25 DEC 1720 25 DEC 1720 Christmas 1720 Moveable Dates around Easter Moveable Dates around Easter (36 possible Easter dates with leap year (36 possible Easter dates with leap year variation) variation) → → 1723 Dnica Septuagesima 1723 Dnica Septuagesima 24 JAN 24 JAN 1723 1723 Same day as previous entry Same day as previous entry 17 17
Run Ontology Run Ontology Input Input Ontology Ontology (Created with OntoES) (Created with OntoES) HTML data HTML data (Hypertext Markup Language) (Hypertext Markup Language) Output Output RDF database RDF database (Resource Description (Resource Description Format) Format) OWL fjle OWL fjle (Ontology Web Language) (Ontology Web Language) 18 18
Ontology Workbench Ontology Workbench 19 19
Extracted Marriages Extracted Marriages Bet MarDate NameM NameF NameU Date Christian same day Nicholas Patch Denma 1576 n 26 JAN Richard Patch Joan Lavor 1605 26 SEP John Elliott Joan Woodbery 1613 7 AUG Thomas Prime Maria Parry 1615 29 JAN Elizabeth William Woodbery 1616 Patch 2 MAY Fortu: William Hillerd 1620 Patch 20 17 SEP Elizabeth Nicholas Patch 1622 Owlsey 20
Sample RDF Triples Person_10 | sameAs | Person_10 Person_10 | type| Thing Person_10 | type| Person NameU_0 | NameUValue | “Christian Denman” NameU_0 | sameAs | NameU_0 NameU_0 | type| Thing NameU_0 |type | NameU NameM_4 | NameMValue | “Nicholas Patch” NameM_4 | sameAs | NameM_4 NameM_4 | type| Thing NameM_4 |type | NameM 21
OWL File OWL HEADER <owl:Class rdf:ID="MarriageRecord"/> <owl:Class rdf:ID="Person"/> <owl:Class rdf:ID="NameU"/> <owl:DatatypeProperty rdf:ID="NameUValue"> <rdfs:domain rdf:resource="#NameU"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty> PERSON - NAMEU <owl:ObjectProperty rdf:ID="Person-NameU"> <rdfs:domain rdf:resource="#Person"/> <rdfs:range rdf:resource="#NameU"/> <owl:inverseOf> <owl:ObjectProperty rdf:ID="NameU-Person"/> </owl:inverseOf> </owl:ObjectProperty> 22
3. OWL File and SWRL 3. OWL File and SWRL Rules Rules Defjne OWL Class Defjne OWL Class Example – Husband Example – Husband <owl:Class rdf:ID="Husband"/> <owl:Class rdf:ID="Husband"/> Defjne Rule Defjne Rule Example – Person with male name is a Example – Person with male name is a Husband Husband Person-NameM(?x,?y) -> Husband(?x) Person-NameM(?x,?y) -> Husband(?x) ?y ?x 23 23
Related Rules Related Rules NameF is populated then value in NameU NameF is populated then value in NameU is Husband is Husband Person-NameU(?x,?y) Person-NameF(?w,?v) Person-NameF(?w,?v) Person-NameU(?x,?y) MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?w) MarriageRecord-Person(?z,?w) -> Husband(?x) -> Husband(?x) ?z ?x ?v ?w ?y 24 24
HusbandOf Rule HusbandOf Rule Husband(?x) Wife(?y) MarriageRecord- Wife(?y) MarriageRecord- Husband(?x) Person(?z,?x) Person(?z,?x) MarriageRecord-Person(?z,?y) MarriageRecord-Person(?z,?y) -> HusbandOf(?x,?y) -> HusbandOf(?x,?y) 25 25
Auxiliary Name Rules Auxiliary Name Rules NameM(?x) -> Name(?x) NameM(?x) -> Name(?x) NameF(?x) -> Name(?x) NameF(?x) -> Name(?x) NameU(?x) -> Name(?x) NameU(?x) -> Name(?x) NameMValue(?x) -> NameValue(?x) NameMValue(?x) -> NameValue(?x) NameFValue(?x) -> NameValue(?x) NameFValue(?x) -> NameValue(?x) NameUValue(?x) -> NameValue(?x) NameUValue(?x) -> NameValue(?x) Person-NameM(?x,?y) -> Person-Name(?x,? Person-NameM(?x,?y) -> Person-Name(?x,? y) y) Person-NameF(?x,?y) -> Person-Name(?x,?y) Person-NameF(?x,?y) -> Person-Name(?x,?y) 26 Person-NameU(?x,?y) -> Person-Name(?x,?y) Person-NameU(?x,?y) -> Person-Name(?x,?y) 26
Recommend
More recommend