Noisy text in Google Books OCR in Google Research Challenges Methods compared against • OCR-confidence output, using several versions of a commercial engine • HEUR-T, HEUR-K: Heuristics by Taghva ‘01; Kulp ‘07 • Dictionaries: Extract vocabulary files from Web data. Use the most frequent N terms, where N ranges from 1K to 1M • Hard Dictionary (HDL, HDM): Penalize passage by C 1 for each OOV term; penalize by C 2 for each punctuation-or-symbol tokenized as a singleton. • Soft Dictionary (SDL, SDM): For each term in a passage, find edit distance to dictionary word (or C 2 for punctuation-or-symbol tokenized as a singleton.) Penalty for the passage is the total edit distance divided by the passage length in Unicode points Ashok C. Popat September 17, 2011 22 / 73
Noisy text in Google Books OCR in Google Research Challenges Comparison among methods Table: All considered languages (approx 30) s 2 condition N 95% CI ¯ τ intra-rater 0.790 522 0.050 (0.770, 0.809) inter-rater 0.668 3056 0.087 (0.658, 0.679) OCR-conf 0.263 2610 0.216 (0.245, 0.280) HEUR-T 0.339 2610 0.146 (0.325, 0.354) HEUR-K 0.381 2610 0.149 (0.367, 0.396) SEQ 0.600 2610 0.090 (0.589, 0.612) SPA 0.665 2610 0.086 (0.654, 0.676) Ashok C. Popat September 17, 2011 23 / 73
Noisy text in Google Books OCR in Google Research Challenges Comparison among methods (continued) Table: Eleven intersection languages s 2 condition N 95% CI ¯ τ intra-rater 0.803 291 0.052 (0.777, 0.829) inter-rater 0.665 1895 0.093 (0.651, 0.679) OCR-conf 0.251 1455 0.239 (0.226, 0.276) HEUR-T 0.375 1455 0.135 (0.356, 0.394) HEUR-K 0.428 1455 0.141 (0.408, 0.447) HDM1M 0.516 1455 0.111 (0.499, 0.533) SDM50K 0.586 1455 0.106 (0.570, 0.603) SEQ 0.607 1455 0.094 (0.592, 0.623) SPA 0.670 1455 0.087 (0.655, 0.686) Ashok C. Popat September 17, 2011 24 / 73
Noisy text in Google Books OCR in Google Research Challenges Application to e-book readers • For a given paragraph in an e-book, is it better to render the text or swap in the image? Ashok C. Popat September 17, 2011 25 / 73
Noisy text in Google Books OCR in Google Research Challenges Application to e-book readers • For a given paragraph in an e-book, is it better to render the text or swap in the image? Ashok C. Popat September 17, 2011 25 / 73
Noisy text in Google Books OCR in Google Research Challenges Application to mobile device OCR • Can we select only the Good OCR text from a given image region? • Viterbi search: • Two states: garbage and clean • Scores computed as described, plus transition costs • Transitions discounted based on image distance between symbols • About 30 languages enabled; language not set in advance Ashok C. Popat September 17, 2011 26 / 73
Noisy text in Google Books OCR in Google Research Challenges Example 1 Ashok C. Popat September 17, 2011 27 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine A Ashok C. Popat September 17, 2011 28 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine B Ashok C. Popat September 17, 2011 29 / 73
Noisy text in Google Books OCR in Google Research Challenges Example 2 Ashok C. Popat September 17, 2011 30 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine A Ashok C. Popat September 17, 2011 31 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine B Ashok C. Popat September 17, 2011 32 / 73
Noisy text in Google Books OCR in Google Research Challenges Example 3 Ashok C. Popat September 17, 2011 33 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine A Ashok C. Popat September 17, 2011 34 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine B Ashok C. Popat September 17, 2011 35 / 73
Noisy text in Google Books OCR in Google Research Challenges Example 4 Ashok C. Popat September 17, 2011 36 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine A Ashok C. Popat September 17, 2011 37 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR Engine B Ashok C. Popat September 17, 2011 38 / 73
Noisy text in Google Books OCR in Google Research Challenges Summary • Pan-lingual detector of noisy text • Spatial and sequential versions • Works well for most of the approx. 30 languages considered • Works well relative to several plausible alternatives • Application in books and beyond Ashok C. Popat September 17, 2011 39 / 73
Noisy text in Google Books OCR in Google Research Challenges . . . which brings us to OCR • Joint work with. . . Eugene Ie, Mike Jahr, Dmitriy Genzel, Franz Och, Andrew Senior, Nemanja Spasojevic, Frank Tang, Remco Teunen, others Ashok C. Popat September 17, 2011 40 / 73
Noisy text in Google Books OCR in Google Research Challenges OCR in Google Research • Organize the world’s information and make it universally accessible and useful • OCR still unavailable for some important languages • Take advantage of latest technologies • Massive amounts of data available • Goal: Best-in-the-world OCR for all scripts and languages Ashok C. Popat September 17, 2011 41 / 73
Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73
Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73
Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73
Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73
Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73
Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73
Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch Ashok C. Popat September 17, 2011 43 / 73
Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features Ashok C. Popat September 17, 2011 43 / 73
Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features • MERT-optimized log-linear combination Ashok C. Popat September 17, 2011 43 / 73
Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features • MERT-optimized log-linear combination • Latest algorithms, e.g., from speech Ashok C. Popat September 17, 2011 43 / 73
Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features • MERT-optimized log-linear combination • Latest algorithms, e.g., from speech • Data-driven based on massive amounts of data Ashok C. Popat September 17, 2011 43 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 44 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 45 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 46 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 47 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 48 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 49 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 50 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 51 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 52 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 53 / 73
Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 54 / 73
Noisy text in Google Books OCR in Google Research Challenges Good progress so far. . . Ashok C. Popat September 17, 2011 55 / 73
Noisy text in Google Books OCR in Google Research Challenges . . . but by no means done Ashok C. Popat September 17, 2011 56 / 73
Noisy text in Google Books OCR in Google Research Challenges Example: Thai Ashok C. Popat September 17, 2011 57 / 73
Noisy text in Google Books OCR in Google Research Challenges Example: Thai Ashok C. Popat September 17, 2011 58 / 73
Noisy text in Google Books OCR in Google Research Challenges Bootstrapping a basic Thai-capable system • Steps 1 Download 25Mb of Thai text from Wikisource 2 Generate synthetic training data from text 3 Split data into training and dev set 4 Train LM from training set 5 Train optical models from training set 6 Tune system on dev set (MERT) 7 Run on images from Google books • Entire process: ∼ 12 hours! • Crippled system: small LM, small optical models, few fonts, no real dev set, no ground-truth test set Ashok C. Popat September 17, 2011 59 / 73
Noisy text in Google Books OCR in Google Research Challenges Current topics of interest • Synthetic training data • Unsupervised / discriminative training • Discriminative feature extraction • More languages Ashok C. Popat September 17, 2011 60 / 73
Noisy text in Google Books OCR in Google Research Challenges Challenges in Google Books Ashok C. Popat September 17, 2011 61 / 73
Noisy text in Google Books OCR in Google Research Challenges Joint work with. . . Dar-Shyang Lee, Jeff Breidenbach, Stavan Parikh, Viresh Ratnakar, Ray Smith, Ranjith Unnikrishnan, others Ashok C. Popat September 17, 2011 62 / 73
Noisy text in Google Books OCR in Google Research Challenges Challenges: Multiple scripts/languages on a page Ashok C. Popat September 17, 2011 63 / 73
Noisy text in Google Books OCR in Google Research Challenges Challenges: per-word script and language variation Ashok C. Popat September 17, 2011 64 / 73
Noisy text in Google Books OCR in Google Research Challenges Challenges: Geometric and graylevel distortions Ashok C. Popat September 17, 2011 65 / 73
Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue Ashok C. Popat September 17, 2011 66 / 73
Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue • The same language in multiple scripts • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi, Serbian, Pali Ashok C. Popat September 17, 2011 66 / 73
Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue • The same language in multiple scripts • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi, Serbian, Pali • Archaic and reformed orthographies • Fraktur, Imperial Russian, 18th century English Ashok C. Popat September 17, 2011 66 / 73
Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue • The same language in multiple scripts • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi, Serbian, Pali • Archaic and reformed orthographies • Fraktur, Imperial Russian, 18th century English • Dark matter: what scripts and languages are actually present? Ashok C. Popat September 17, 2011 66 / 73
Noisy text in Google Books OCR in Google Research Challenges More challenge examples Ashok C. Popat September 17, 2011 67 / 73
Noisy text in Google Books OCR in Google Research Challenges More challenge examples Ashok C. Popat September 17, 2011 67 / 73
Noisy text in Google Books OCR in Google Research Challenges More challenge examples (cont.) Ashok C. Popat September 17, 2011 68 / 73
Noisy text in Google Books OCR in Google Research Challenges More challenge examples (cont.) Ashok C. Popat September 17, 2011 68 / 73
Noisy text in Google Books OCR in Google Research Challenges More challenge examples (cont.) Ashok C. Popat September 17, 2011 69 / 73
Recommend
More recommend