handwriting recognition handwriting recognition for
play

Handwriting Recognition Handwriting Recognition for Genealogical - PowerPoint PPT Presentation

Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical Records Luke Hutchison (lukeh@email.byu.edu) Luke Hutchison (lukeh@email.byu.edu) [Advisor: Dr. Tom Sederberg] [Advisor: Dr. Tom Sederberg] Handwriting


  1. Handwriting Recognition Handwriting Recognition for Genealogical Records for Genealogical Records Luke Hutchison (lukeh@email.byu.edu) Luke Hutchison (lukeh@email.byu.edu) [Advisor: Dr. Tom Sederberg] [Advisor: Dr. Tom Sederberg]

  2. Handwriting Recognition Handwriting Recognition • Two different fields: Two different fields: • Online Handwriting Recognition Online Handwriting Recognition  The writer's pen movements are captured The writer's pen movements are captured  Velocity, acceleration, stroke order available Velocity, acceleration, stroke order available • Offline Handwriting Recognition Offline Handwriting Recognition  Page was previously-written and scanned Page was previously-written and scanned  Only pixel color information available Only pixel color information available • Genealogical records are all offline Genealogical records are all offline • Offline is harder (less information Offline is harder (less information is available) is available) Mar y

  3. Handwriting Recognition Handwriting Recognition • Can we just convert offline data into (simulated) online data? Can we just convert offline data into (simulated) online data? • Yes, although difficult to do reliably: Yes, although difficult to do reliably:  What order were the strokes written in? What order were the strokes written in?  Doubled-up line segments? Ink blobs? Spurious joins between Doubled-up line segments? Ink blobs? Spurious joins between letters? Missing joins? letters? Missing joins? • Especially difficult with genealogical records Especially difficult with genealogical records

  4. Handwriting Recognition Handwriting Recognition • A successful approach must combine results from analysis of different domains, and at different levels of abstraction, e.g. A successful approach must combine results from analysis of different domains, and at different levels of abstraction, e.g. • Discrete: Discrete:  Stroke segmentation and ordering Stroke segmentation and ordering  Digraph frequency tables, lexicons Digraph frequency tables, lexicons • Continuous: Continuous:  Letter shape analysis and matching Letter shape analysis and matching

  5. Handwriting Recognition Handwriting Recognition • An example of some common steps in the analysis An example of some common steps in the analysis process: process: • Contour extraction Contour extraction • Midline determination Midline determination • Stroke ordering Stroke ordering

  6. Handwriting Recognition Handwriting Recognition • An example of some steps in the recognition process: An example of some steps in the recognition process: • Handwriting style clustering Handwriting style clustering • Letter recognition Letter recognition nr? m? • Approximate string matching Approximate string matching Smith Smythe

  7. HR for Genealogical Records HR for Genealogical Records • Image quality is not always good with microfilms Image quality is not always good with microfilms  Fading of documents / microfilm Fading of documents / microfilm  Ink-well pens Ink-well pens • But documents were usually written meticulously But documents were usually written meticulously  Older handwriting more regular; simpler to match Older handwriting more regular; simpler to match  Different approach required Different approach required

  8. The Approach The Approach • Outlines of word are traced and smoothed Outlines of word are traced and smoothed • Some common sources of variation (e.g. differences in slope) Some common sources of variation (e.g. differences in slope) are automatically corrected for. are automatically corrected for.

  9. The Approach The Approach • Robustly produce a characteristic “signature” for each letter Robustly produce a characteristic “signature” for each letter

  10. The Approach The Approach • Find possible letter matches and determine possible readings (with accuracy of fit) Find possible letter matches and determine possible readings (with accuracy of fit) W S l l i k u a r n w i i o ww M n s r u i m a U m t J O o => Williarw Suwkino (65%), ... , JiiUiom Oartums (1%)

  11. The Approach The Approach • Error Correction: Letter digraph frequencies Error Correction: Letter digraph frequencies  E E _ 2.617% _ 2.617%  E E R 1.438% R 1.438%  N N _ 1.280% _ 1.280%  A A N 1.276% N 1.276%  _ _ S 1.212% S 1.212%  O O N 1.207% N 1.207%  I I N 1.187% N 1.187%  E E N 1.174% N 1.174%  [...] [...]  A A W 0.075% W 0.075%  N N K 0.074% K 0.074%  T T L 0.071% L 0.071%  [...] [...]  U U W 0.000% W 0.000% Suwkino --> Sawkino

  12. The Approach The Approach • Error Correction: Name Lexicon Error Correction: Name Lexicon • Last names: Last names:  Smith Smith 1.105% 1.105%  Jones Jones 0.817% 0.817%  Williams Williams 0.653% 0.653%  Brown Brown 0.371% 0.371%  [...] [...]  Sawkins Sawkins 0.012% 0.012% • First Names: First Names:  James James 1.615% 1.615%  John John 1.203% 1.203%  Robert Robert 1.022% 1.022%  Michael Michael 0.971% 0.971%  William William 0.954% 0.954% => William Sawkins (95%)

  13. Conclusions Conclusions • [Work in progress] [Work in progress] • (Semi-) Automated extraction system could dramatically reduce extraction time (Semi-) Automated extraction system could dramatically reduce extraction time • [Demo: Concept search engine...] [Demo: Concept search engine...]

Recommend


More recommend