Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical Records Luke Hutchison (lukeh@email.byu.edu) Luke Hutchison (lukeh@email.byu.edu) [Advisor: Dr. Tom Sederberg] [Advisor: Dr. Tom Sederberg]
Handwriting Recognition Handwriting Recognition • Two different fields: Two different fields: • Online Handwriting Recognition Online Handwriting Recognition The writer's pen movements are captured The writer's pen movements are captured Velocity, acceleration, stroke order available Velocity, acceleration, stroke order available • Offline Handwriting Recognition Offline Handwriting Recognition Page was previously-written and scanned Page was previously-written and scanned Only pixel color information available Only pixel color information available • Genealogical records are all offline Genealogical records are all offline • Offline is harder (less information Offline is harder (less information is available) is available) Mar y
Handwriting Recognition Handwriting Recognition • Can we just convert offline data into (simulated) online data? Can we just convert offline data into (simulated) online data? • Yes, although difficult to do reliably: Yes, although difficult to do reliably: What order were the strokes written in? What order were the strokes written in? Doubled-up line segments? Ink blobs? Spurious joins between Doubled-up line segments? Ink blobs? Spurious joins between letters? Missing joins? letters? Missing joins? • Especially difficult with genealogical records Especially difficult with genealogical records
Handwriting Recognition Handwriting Recognition • A successful approach must combine results from analysis of different domains, and at different levels of abstraction, e.g. A successful approach must combine results from analysis of different domains, and at different levels of abstraction, e.g. • Discrete: Discrete: Stroke segmentation and ordering Stroke segmentation and ordering Digraph frequency tables, lexicons Digraph frequency tables, lexicons • Continuous: Continuous: Letter shape analysis and matching Letter shape analysis and matching
Handwriting Recognition Handwriting Recognition • An example of some common steps in the analysis An example of some common steps in the analysis process: process: • Contour extraction Contour extraction • Midline determination Midline determination • Stroke ordering Stroke ordering
Handwriting Recognition Handwriting Recognition • An example of some steps in the recognition process: An example of some steps in the recognition process: • Handwriting style clustering Handwriting style clustering • Letter recognition Letter recognition nr? m? • Approximate string matching Approximate string matching Smith Smythe
HR for Genealogical Records HR for Genealogical Records • Image quality is not always good with microfilms Image quality is not always good with microfilms Fading of documents / microfilm Fading of documents / microfilm Ink-well pens Ink-well pens • But documents were usually written meticulously But documents were usually written meticulously Older handwriting more regular; simpler to match Older handwriting more regular; simpler to match Different approach required Different approach required
The Approach The Approach • Outlines of word are traced and smoothed Outlines of word are traced and smoothed • Some common sources of variation (e.g. differences in slope) Some common sources of variation (e.g. differences in slope) are automatically corrected for. are automatically corrected for.
The Approach The Approach • Robustly produce a characteristic “signature” for each letter Robustly produce a characteristic “signature” for each letter
The Approach The Approach • Find possible letter matches and determine possible readings (with accuracy of fit) Find possible letter matches and determine possible readings (with accuracy of fit) W S l l i k u a r n w i i o ww M n s r u i m a U m t J O o => Williarw Suwkino (65%), ... , JiiUiom Oartums (1%)
The Approach The Approach • Error Correction: Letter digraph frequencies Error Correction: Letter digraph frequencies E E _ 2.617% _ 2.617% E E R 1.438% R 1.438% N N _ 1.280% _ 1.280% A A N 1.276% N 1.276% _ _ S 1.212% S 1.212% O O N 1.207% N 1.207% I I N 1.187% N 1.187% E E N 1.174% N 1.174% [...] [...] A A W 0.075% W 0.075% N N K 0.074% K 0.074% T T L 0.071% L 0.071% [...] [...] U U W 0.000% W 0.000% Suwkino --> Sawkino
The Approach The Approach • Error Correction: Name Lexicon Error Correction: Name Lexicon • Last names: Last names: Smith Smith 1.105% 1.105% Jones Jones 0.817% 0.817% Williams Williams 0.653% 0.653% Brown Brown 0.371% 0.371% [...] [...] Sawkins Sawkins 0.012% 0.012% • First Names: First Names: James James 1.615% 1.615% John John 1.203% 1.203% Robert Robert 1.022% 1.022% Michael Michael 0.971% 0.971% William William 0.954% 0.954% => William Sawkins (95%)
Conclusions Conclusions • [Work in progress] [Work in progress] • (Semi-) Automated extraction system could dramatically reduce extraction time (Semi-) Automated extraction system could dramatically reduce extraction time • [Demo: Concept search engine...] [Demo: Concept search engine...]
Recommend
More recommend