 
              Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical Records Luke Hutchison (lukeh@email.byu.edu) Luke Hutchison (lukeh@email.byu.edu) [Advisor: Dr. Tom Sederberg] [Advisor: Dr. Tom Sederberg]
Handwriting Recognition Handwriting Recognition • Two different fields: Two different fields: • Online Handwriting Recognition Online Handwriting Recognition  The writer's pen movements are captured The writer's pen movements are captured  Velocity, acceleration, stroke order available Velocity, acceleration, stroke order available • Offline Handwriting Recognition Offline Handwriting Recognition  Page was previously-written and scanned Page was previously-written and scanned  Only pixel color information available Only pixel color information available • Genealogical records are all offline Genealogical records are all offline • Offline is harder (less information Offline is harder (less information is available) is available) Mar y
Handwriting Recognition Handwriting Recognition • Can we just convert offline data into (simulated) online data? Can we just convert offline data into (simulated) online data? • Yes, although difficult to do reliably: Yes, although difficult to do reliably:  What order were the strokes written in? What order were the strokes written in?  Doubled-up line segments? Ink blobs? Spurious joins between Doubled-up line segments? Ink blobs? Spurious joins between letters? Missing joins? letters? Missing joins? • Especially difficult with genealogical records Especially difficult with genealogical records
Handwriting Recognition Handwriting Recognition • A successful approach must combine results from analysis of different domains, and at different levels of abstraction, e.g. A successful approach must combine results from analysis of different domains, and at different levels of abstraction, e.g. • Discrete: Discrete:  Stroke segmentation and ordering Stroke segmentation and ordering  Digraph frequency tables, lexicons Digraph frequency tables, lexicons • Continuous: Continuous:  Letter shape analysis and matching Letter shape analysis and matching
Handwriting Recognition Handwriting Recognition • An example of some common steps in the analysis An example of some common steps in the analysis process: process: • Contour extraction Contour extraction • Midline determination Midline determination • Stroke ordering Stroke ordering
Handwriting Recognition Handwriting Recognition • An example of some steps in the recognition process: An example of some steps in the recognition process: • Handwriting style clustering Handwriting style clustering • Letter recognition Letter recognition nr? m? • Approximate string matching Approximate string matching Smith Smythe
HR for Genealogical Records HR for Genealogical Records • Image quality is not always good with microfilms Image quality is not always good with microfilms  Fading of documents / microfilm Fading of documents / microfilm  Ink-well pens Ink-well pens • But documents were usually written meticulously But documents were usually written meticulously  Older handwriting more regular; simpler to match Older handwriting more regular; simpler to match  Different approach required Different approach required
The Approach The Approach • Outlines of word are traced and smoothed Outlines of word are traced and smoothed • Some common sources of variation (e.g. differences in slope) Some common sources of variation (e.g. differences in slope) are automatically corrected for. are automatically corrected for.
The Approach The Approach • Robustly produce a characteristic “signature” for each letter Robustly produce a characteristic “signature” for each letter
The Approach The Approach • Find possible letter matches and determine possible readings (with accuracy of fit) Find possible letter matches and determine possible readings (with accuracy of fit) W S l l i k u a r n w i i o ww M n s r u i m a U m t J O o => Williarw Suwkino (65%), ... , JiiUiom Oartums (1%)
The Approach The Approach • Error Correction: Letter digraph frequencies Error Correction: Letter digraph frequencies  E E _ 2.617% _ 2.617%  E E R 1.438% R 1.438%  N N _ 1.280% _ 1.280%  A A N 1.276% N 1.276%  _ _ S 1.212% S 1.212%  O O N 1.207% N 1.207%  I I N 1.187% N 1.187%  E E N 1.174% N 1.174%  [...] [...]  A A W 0.075% W 0.075%  N N K 0.074% K 0.074%  T T L 0.071% L 0.071%  [...] [...]  U U W 0.000% W 0.000% Suwkino --> Sawkino
The Approach The Approach • Error Correction: Name Lexicon Error Correction: Name Lexicon • Last names: Last names:  Smith Smith 1.105% 1.105%  Jones Jones 0.817% 0.817%  Williams Williams 0.653% 0.653%  Brown Brown 0.371% 0.371%  [...] [...]  Sawkins Sawkins 0.012% 0.012% • First Names: First Names:  James James 1.615% 1.615%  John John 1.203% 1.203%  Robert Robert 1.022% 1.022%  Michael Michael 0.971% 0.971%  William William 0.954% 0.954% => William Sawkins (95%)
Conclusions Conclusions • [Work in progress] [Work in progress] • (Semi-) Automated extraction system could dramatically reduce extraction time (Semi-) Automated extraction system could dramatically reduce extraction time • [Demo: Concept search engine...] [Demo: Concept search engine...]
Recommend
More recommend