Genealogical Place Name Normalization Bob Leaman - PowerPoint PPT Presentation

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu)

What is meant by “Normalization”? • Enforcing a standardized representation • Increases accuracy •Data shared over e-mail can be very hard to correct •Easier record linkage •Automated merging •Automated research 3 Apr 2003 Genealogical Place Na 2 me Normalization

What format to use? • Fixed three-level • Mesa, Maricopa, Arizona • Variable-level • Mesa, Maricopa, Arizona, United States • Note absence of descriptors •“Of”, “Near”, etc. 3 Apr 2003 Genealogical Place Na 3 me Normalization

The Problem What kinds of deviations from the standard are common? • Biographical notes • Johnsville, Arkansas. He had 6 children • Addresses and e-mails • Hospital, church and cemetery names • Bluff Cemetery, Elgin, Ill.  Elgin, Ill. • Leaving out one or more of the levels • Vancouver, Washington  Vancouver, Clark, Washington, United States 3 Apr 2003 Genealogical Place Na 4 me Normalization

The Problem • Excluding the comma between two of the place names • San Leandro CA  San Leandro , CA • Using an abbreviated, truncated, or alternate form of a place name • UT  Utah • Tenn  Tennessee • Holland  Denmark • Misspelling place names • Ypfilanti, Washtinaud, Michigan  Ypsilanti, Washtenaw, Michigan • Algorithmic contractions such as removing all vowels after the first letter • Oxfrd  Oxford 3 Apr 2003 Genealogical Place Na 5 me Normalization

Strategy • Preprocessing – remove everything that is not part of the place name • Match against a name variations database (thesaurus) • Match against standardized names database (gazetteer) 3 Apr 2003 Genealogical Place Na 6 me Normalization

Preprocessing Place Names • Use regular expressions to detect patterns • 38th year, Benedict, Kansas. Buried High Prairie Cem, Wilson, Kansas becomes • 38th year, Benedict, Kansas. becomes • Benedict, Kansas • List of “note words” (e.g. occupations, causes of death, etc.) 3 Apr 2003 Genealogical Place Na 7 me Normalization

Preprocessing Place Names • Tested on 2450 randomly selected “PLAC” fields from 10 different GEDCOM files • Each was preprocessed by hand: 58.4% required modification • Preprocessing via the system matched preprocessing by hand 97.6% of the time 3 Apr 2003 Genealogical Place Na 8 me Normalization

Handling Name Variations • At this point all non-place name information has been removed • Each place name is looked up in a database of alternate names (thesaurus) • Livonia, MI  {Livonia, MI & Livonia, Michigan} • The original is included in case the wrong alternate was recorded originally 3 Apr 2003 Genealogical Place Na 9 me Normalization

Place Name Matching • Created a place name database • Mostly GNIS data • Includes all of the United States and some of England and Canada • Nearly 160,000 places • Database format • A single table was used to hold all place records • Utilized unique identifiers to point to the “parent” record 3 Apr 2003 Genealogical Place Na 10 me Normalization

Place Name Matching • Need to find the place name in the database that maximizes the “similarity” with respect to the input place name • 0 = no match • 1 = perfect match • Calculated using the average “similarity” of the individual pieces of the place name 3 Apr 2003 Genealogical Place Na 11 me Normalization

Place Name Matching • Used the elements of the edit distance metric • Substitution, insertion, deletion • Added transposition, length of the longest common substring & a measure of truncation • Sorted through the several data points per potential match with a decision tree • Trained using the metric scores from a test set of place name pieces matched by hand • S Lk, Salt Lake, TRUE • Used the proportion of test cases that were matches in any leaf of the tree as the “similarity” score 3 Apr 2003 Genealogical Place Na 12 me Normalization

Place Name Matching • Tested on 330 randomly selected “PLAC” fields from 10 different GEDCOM files • Each was preprocessed and matched by hand: 99.1% required modification after preprocessing • The first-ranked match was the same as the match found by hand 97.9% of the time • The average rank of the match generated by hand was 1.21 3 Apr 2003 Genealogical Place Na 13 me Normalization

Future Directions • Recognize when the best match is not satisfactory • Acquisition of a suitable thesaurus and gazetteer • Alexandria Digital Library Project • Historical place information • Increased productization • Indexing scheme • Internationalization 3 Apr 2003 Genealogical Place Na 14 me Normalization

Questions? • Reference: K. Kukich. Techniques for Automatically Correcting Words in Text . Computing Surveys, 24(4):377-440, Dec. 1992. 3 Apr 2003 Genealogical Place Na 15 me Normalization

Genealogical Place Name Normalization Bob Leaman - PowerPoint PPT Presentation

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu) What is meant by Normalization? Enforcing a standardized representation Increases accuracy Data shared over e-mail can be very hard to correct Easier

Normal forms and normalization An example of normalization using normal forms We assume we have

Introduction to Genealogy Claire Kluskens Learn how to do basic genealogical research using

City Directories: Their Genealogical Value Marjorie (Margie) STEIN Beldin margiebeldin@gmail.com

On a Class of Genetic Genealogical Tree Models P. DEL MORAL Lab. J.A. Dieudonn e, Univ. Nice

Graph-Based Remerging of Genealogical Databases D. Randall Wilson fonix Corporation Draper,

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

National Archives Innovative Online Resources and Tools to Help with Your Genealogical Research

Normalization Redundancy causes several anomalies : insert, delete and update

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Evaluating Strategies for Finding Genealogical Information on the Web Dallan Quass Nathan Powell

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Some key themes . Ruthless skepticism Genealogical method Tie problem of

Nam e Standardization Nam e Standardization for Genealogical for Genealogical Record Linkage

Relational Normalization Theory Chapter 6 1 Limitations of E-R Designs Provides a set of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

19-10-31 Phylogenetics 2: Phylogenetic and genealogical homology Alignment of mammalian

2 nd place 3 rd place 5 th place 17 th place Ledning och styrning Vision, ml

From reference genes to global mean normalization Jo Vandesompele professor, Ghent University

AMR Normalization for Fairer Evaluation Michael Wayne Goodman goodmami@uw.edu Nanyang

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical

Technology: Changing the Genealogical Paradigm - 1 T echnology: T echnology: Shifting the

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical