comparison between historical population archives and
play

Comparison between historical population archives and decentralized - PowerPoint PPT Presentation

Introduction Matching Verification Application Conclusion Comparison between historical population archives and decentralized databases Marijn Schraagen Dionysius Huijsmans Leiden Institute of Advanced Computer Science (LIACS) Leiden


  1. Introduction Matching Verification Application Conclusion Comparison between historical population archives and decentralized databases Marijn Schraagen Dionysius Huijsmans Leiden Institute of Advanced Computer Science (LIACS) Leiden University, The Netherlands LaTeCH Workshop 2013

  2. Introduction Matching Verification Application Conclusion Research subject Historical databases have increasingly become digitized Census data, civil registry, church records, trade records, . . . Millions of interrelated records → historical social networks However, network structure is not given Alternative data sources: personal and local archives Family trees, legal archives, . . . Small amount of information Relations between records generally indicated and verified Research goal: combine the information from different sources

  3. Introduction Matching Verification Application Conclusion Outline Introduction 1 Matching 2 Verification 3 Application 4 Conclusion 5

  4. Introduction Matching Verification Application Conclusion Motivation Links between (historical) records are important for a wide range of applications Data Mining : graph traversal algorithms, community detection Humanities : migration patterns, family size, occupational development Linguistics : stability of spelling, morphology, phonetics Onomastics : name inheritance, geographical name distribution

  5. Introduction Matching Verification Application Conclusion Overview First match records from databases X and Y, then identify complementary or conflicting links match? birth record X 1 birth record Y 1 L a L b link compare match? death record X 2 death record Y 2 Example : If X 1 = Y 1 but X 2 � = Y 2 then either L a or L b or both are wrong.

  6. Introduction Matching Verification Application Conclusion Data formats Large-scale historical databases Syntax usually structured XML, SQL, comma-separated Occasionally structured natural language is used Semantics generally based on events Birth, marriage, baptism, change of ownership Exception: census records Family databases Syntax often the legacy Gedcom format Hierarchical level numbers and tags Semantics generally based on individuals and families

  7. Introduction Matching Verification Application Conclusion Example historical databases Genlias civil certificate database Official registration of birth, marriage and death The Netherlands, ∼ 1811-1920 15 million certificates (events) Gedcom family archive Hand-compiled from various sources Mostly northern part of the Netherlands, ∼ 1600-now 1750 records (individuals and families) Overlap: ∼ 1100 events, of which ∼ 600 births

  8. Introduction Matching Verification Application Conclusion Data formats example Civil certificate Family archive Type: birth certificate 0 @F294@ FAM Serial number: 176 1 HUSB @I840@ Date: 16-05-1883 1 WIFE @I787@ Place: Wonseradeel 1 CHIL @I848@ Child: Sierk Rolsma 1 CHIL @I849@ Father: Sjoerd Rolsma · · · Mother: Agnes Weldring 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

  9. Introduction Matching Verification Application Conclusion Data formats example Civil certificate Family archive Type: birth certificate 0 @F294@ FAM Serial number: 176 1 HUSB @I840@ Date: 16-05-1883 1 WIFE @I787@ Place: Wonseradeel 1 CHIL @I848@ Child: Sierk Rolsma 1 CHIL @I849@ Father: Sjoerd Rolsma · · · Mother: Agnes Weldring 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

  10. Introduction Matching Verification Application Conclusion Parser Grammar Family archive birth → [FAM:CHIL]:child, 0 @F294@ FAM 1 HUSB @I840@ father,mother. child → bdate,bplace,name. 1 WIFE @I787@ father → [FAM:HUSB]:name. 1 CHIL @I848@ mother → [FAM:WIFE]:name. 1 CHIL @I849@ · · · bdate → [INDI:BIRT:DATE]. bplace → [INDI:BIRT:PLAC]. 0 @I787@ INDI name → [INDI:NAME]. 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

  11. Introduction Matching Verification Application Conclusion Record similarity measure The parser provides uniform data for matching two records using similarity requirements for selected fields. Example: Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. If multiple candidates for matching a record are found, then the candidate with the smallest edit distance is selected. Note that the definition is domain specific .

  12. Introduction Matching Verification Application Conclusion Matching example Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. Civil certificate Family archive Date: 16-05-1883 Date: 16 MAY 1883 Child: Sierk Rolsma Child: Sierk Rolsma Mother: Agnes Weldring Mother: Agnes Welderink Three out of four names equal ( Sierk , Rolsma , Agnes ), year of birth equal (1883) → match

  13. Introduction Matching Verification Application Conclusion Matching results Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. Birth matches: 361/611 (59%) Civil certificate database still in digitization phase Family database contains many peripheral individuals for which parent names and birth date are unknown Similarity measure could be improved Cf. results for marriage certificate matching: 154/176 (88%)

  14. Introduction Matching Verification Application Conclusion Verification Ideal case: gold standard Generally not available for historical databases Large variation in domain and data quality Performance of matching algorithms obtained on one database is not indicative for other databases Unlike, e.g., newspaper archives, e-mail archives, co-author networks, . . . Possible solution: internal verification

  15. Introduction Matching Verification Application Conclusion Internal verification A similarity measure does not necessarily use all record fields for matching Unused fields can provide a support level for a match Example: the birth similarity measure used person names and year of birth Location, exact date of birth, and serial number can be used for verification

  16. Introduction Matching Verification Application Conclusion Verification results birth marriage serial location date dist + + + 177 69 + - + 31 2 – + + 21 41 – + ∼ 33 0 – + – 7 2 – – + 3 10 – – ∼ 6 2 – – – ≤ 3 4 20 – – – > 3 79 8 total 361 154

  17. Introduction Matching Verification Application Conclusion Interpretation of support categories mean % unique serial location date dist + + + 177 100 ok + - + 31 100 ok – + + 21 99.1 ok – + ∼ 33 98.7 ok – – + 3 98.1 ok – – ∼ 6 94.4 likely ok – + – 7 90.0 manual check – – – ≤ 3 4 74.0 manual check – – – > 3 79 74.0 incorrect total 361

  18. Introduction Matching Verification Application Conclusion Application: link comparison First match records from databases X and Y, then identify complementary or conflicting links match? record X 1 record Y 1 L a L b link compare match? record X 2 record Y 2 Application : compare links from Gedcom family archive (given) to links between civil certificates (computed)

  19. Introduction Matching Verification Application Conclusion Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ @F15@ 13-05-1848 09-05-1857 Sjoerd Riemerts Riemersma Jan Johannes Altena Johanna Sikkes van der Zee Klaaske Sikkes van der Zee @F16@ @F17@ @F13@ @F19@ 9797998 @F18@ 02-07-1892 16-11-1889 09-01-1896 13-06-1896 ~1900 08-05-1895 Johannes Altena Eke Foekema Ruurd Altena H Wesseling Hendrikus Wesseling Sikke Altena Elisabeth Vonk Aaltje Altena Anna Jans Rolsma Agatha Altena Agatha Altena Cornelia Verkooyen Cornelia Verkooijen @F122@ @F123@ @F124@ ~1920 ~1925 18-05-1923 Sikkes ? Bartolomeus Mathias van Oerle Sikke Altena IJbeltje Altena Klaaske Altena Trijntje Homminga A tool is developed to explore the link tree Red and blue: matched certificates have differences

Recommend


More recommend