Multilingual and Noisy Data Challenges in Large-Scale Book Scanning - PowerPoint PPT Presentation

Noisy text in Google Books OCR in Google Research Challenges Methods compared against • OCR-confidence output, using several versions of a commercial engine • HEUR-T, HEUR-K: Heuristics by Taghva ‘01; Kulp ‘07 • Dictionaries: Extract vocabulary files from Web data. Use the most frequent N terms, where N ranges from 1K to 1M • Hard Dictionary (HDL, HDM): Penalize passage by C 1 for each OOV term; penalize by C 2 for each punctuation-or-symbol tokenized as a singleton. • Soft Dictionary (SDL, SDM): For each term in a passage, find edit distance to dictionary word (or C 2 for punctuation-or-symbol tokenized as a singleton.) Penalty for the passage is the total edit distance divided by the passage length in Unicode points Ashok C. Popat September 17, 2011 22 / 73

Noisy text in Google Books OCR in Google Research Challenges Comparison among methods Table: All considered languages (approx 30) s 2 condition N 95% CI ¯ τ intra-rater 0.790 522 0.050 (0.770, 0.809) inter-rater 0.668 3056 0.087 (0.658, 0.679) OCR-conf 0.263 2610 0.216 (0.245, 0.280) HEUR-T 0.339 2610 0.146 (0.325, 0.354) HEUR-K 0.381 2610 0.149 (0.367, 0.396) SEQ 0.600 2610 0.090 (0.589, 0.612) SPA 0.665 2610 0.086 (0.654, 0.676) Ashok C. Popat September 17, 2011 23 / 73

Noisy text in Google Books OCR in Google Research Challenges Comparison among methods (continued) Table: Eleven intersection languages s 2 condition N 95% CI ¯ τ intra-rater 0.803 291 0.052 (0.777, 0.829) inter-rater 0.665 1895 0.093 (0.651, 0.679) OCR-conf 0.251 1455 0.239 (0.226, 0.276) HEUR-T 0.375 1455 0.135 (0.356, 0.394) HEUR-K 0.428 1455 0.141 (0.408, 0.447) HDM1M 0.516 1455 0.111 (0.499, 0.533) SDM50K 0.586 1455 0.106 (0.570, 0.603) SEQ 0.607 1455 0.094 (0.592, 0.623) SPA 0.670 1455 0.087 (0.655, 0.686) Ashok C. Popat September 17, 2011 24 / 73

Noisy text in Google Books OCR in Google Research Challenges Application to e-book readers • For a given paragraph in an e-book, is it better to render the text or swap in the image? Ashok C. Popat September 17, 2011 25 / 73

Noisy text in Google Books OCR in Google Research Challenges Application to mobile device OCR • Can we select only the Good OCR text from a given image region? • Viterbi search: • Two states: garbage and clean • Scores computed as described, plus transition costs • Transitions discounted based on image distance between symbols • About 30 languages enabled; language not set in advance Ashok C. Popat September 17, 2011 26 / 73

Noisy text in Google Books OCR in Google Research Challenges Example 1 Ashok C. Popat September 17, 2011 27 / 73

Noisy text in Google Books OCR in Google Research Challenges OCR Engine A Ashok C. Popat September 17, 2011 28 / 73

Noisy text in Google Books OCR in Google Research Challenges OCR Engine B Ashok C. Popat September 17, 2011 29 / 73

Noisy text in Google Books OCR in Google Research Challenges Summary • Pan-lingual detector of noisy text • Spatial and sequential versions • Works well for most of the approx. 30 languages considered • Works well relative to several plausible alternatives • Application in books and beyond Ashok C. Popat September 17, 2011 39 / 73

Noisy text in Google Books OCR in Google Research Challenges . . . which brings us to OCR • Joint work with. . . Eugene Ie, Mike Jahr, Dmitriy Genzel, Franz Och, Andrew Senior, Nemanja Spasojevic, Frank Tang, Remco Teunen, others Ashok C. Popat September 17, 2011 40 / 73

Noisy text in Google Books OCR in Google Research Challenges OCR in Google Research • Organize the world’s information and make it universally accessible and useful • OCR still unavailable for some important languages • Take advantage of latest technologies • Massive amounts of data available • Goal: Best-in-the-world OCR for all scripts and languages Ashok C. Popat September 17, 2011 41 / 73

Noisy text in Google Books OCR in Google Research Challenges A non-trivial task. . . Ashok C. Popat September 17, 2011 42 / 73

Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch Ashok C. Popat September 17, 2011 43 / 73

Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features Ashok C. Popat September 17, 2011 43 / 73

Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features • MERT-optimized log-linear combination Ashok C. Popat September 17, 2011 43 / 73

Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features • MERT-optimized log-linear combination • Latest algorithms, e.g., from speech Ashok C. Popat September 17, 2011 43 / 73

Noisy text in Google Books OCR in Google Research Challenges Approach • Entirely from scratch • Multiple models and features • MERT-optimized log-linear combination • Latest algorithms, e.g., from speech • Data-driven based on massive amounts of data Ashok C. Popat September 17, 2011 43 / 73

Noisy text in Google Books OCR in Google Research Challenges Early results on book images • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 44 / 73

Noisy text in Google Books OCR in Google Research Challenges Early results on book images (cont.) • Image • Transcription (Ref = human annotator; Rec = Google Research) Ashok C. Popat September 17, 2011 45 / 73

Noisy text in Google Books OCR in Google Research Challenges Good progress so far. . . Ashok C. Popat September 17, 2011 55 / 73

Noisy text in Google Books OCR in Google Research Challenges . . . but by no means done Ashok C. Popat September 17, 2011 56 / 73

Noisy text in Google Books OCR in Google Research Challenges Example: Thai Ashok C. Popat September 17, 2011 57 / 73

Noisy text in Google Books OCR in Google Research Challenges Example: Thai Ashok C. Popat September 17, 2011 58 / 73

Noisy text in Google Books OCR in Google Research Challenges Bootstrapping a basic Thai-capable system • Steps 1 Download 25Mb of Thai text from Wikisource 2 Generate synthetic training data from text 3 Split data into training and dev set 4 Train LM from training set 5 Train optical models from training set 6 Tune system on dev set (MERT) 7 Run on images from Google books • Entire process: ∼ 12 hours! • Crippled system: small LM, small optical models, few fonts, no real dev set, no ground-truth test set Ashok C. Popat September 17, 2011 59 / 73

Noisy text in Google Books OCR in Google Research Challenges Current topics of interest • Synthetic training data • Unsupervised / discriminative training • Discriminative feature extraction • More languages Ashok C. Popat September 17, 2011 60 / 73

Noisy text in Google Books OCR in Google Research Challenges Challenges in Google Books Ashok C. Popat September 17, 2011 61 / 73

Noisy text in Google Books OCR in Google Research Challenges Joint work with. . . Dar-Shyang Lee, Jeff Breidenbach, Stavan Parikh, Viresh Ratnakar, Ray Smith, Ranjith Unnikrishnan, others Ashok C. Popat September 17, 2011 62 / 73

Noisy text in Google Books OCR in Google Research Challenges Challenges: Multiple scripts/languages on a page Ashok C. Popat September 17, 2011 63 / 73

Noisy text in Google Books OCR in Google Research Challenges Challenges: per-word script and language variation Ashok C. Popat September 17, 2011 64 / 73

Noisy text in Google Books OCR in Google Research Challenges Challenges: Geometric and graylevel distortions Ashok C. Popat September 17, 2011 65 / 73

Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue Ashok C. Popat September 17, 2011 66 / 73

Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue • The same language in multiple scripts • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi, Serbian, Pali Ashok C. Popat September 17, 2011 66 / 73

Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue • The same language in multiple scripts • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi, Serbian, Pali • Archaic and reformed orthographies • Fraktur, Imperial Russian, 18th century English Ashok C. Popat September 17, 2011 66 / 73

Noisy text in Google Books OCR in Google Research Challenges Other challenges • Multiple languages in same or similar script • Arabic-Farsi, Marathi-Hindi-Nepali • Bad initial OCR can become a virtue • The same language in multiple scripts • Chinese, Japanese, Azarbaijani, Mongolian, Punjabi, Hindi, Serbian, Pali • Archaic and reformed orthographies • Fraktur, Imperial Russian, 18th century English • Dark matter: what scripts and languages are actually present? Ashok C. Popat September 17, 2011 66 / 73

Noisy text in Google Books OCR in Google Research Challenges More challenge examples Ashok C. Popat September 17, 2011 67 / 73

Noisy text in Google Books OCR in Google Research Challenges More challenge examples (cont.) Ashok C. Popat September 17, 2011 68 / 73

Noisy text in Google Books OCR in Google Research Challenges More challenge examples (cont.) Ashok C. Popat September 17, 2011 69 / 73

Multilingual and Noisy Data Challenges in Large-Scale Book Scanning - PowerPoint PPT Presentation

Noisy text in Google Books OCR in Google Research Challenges Multilingual and Noisy Data Challenges in Large-Scale Book Scanning Ashok C. Popat Staff Research Scientist, Google, Inc. September 17, 2011 Ashok C. Popat September 17, 2011 1 /

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

From multilingual documents to multilingual websites: challenges for international organizations

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research Wang

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Multi-parameter regularization for ill-posed problems with noisy right hand side and noisy

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e )

Learning Nearest Neighbor Graphs from Noisy Distance Samples Noisy Distance Samples Blake Mason,

Learning to denoise without clean data Joshua Batson hep-ai seminar 10/18/18 Noisy data is

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

1 Layering of Protocols Protocol FTP Client Mail client Web Many browser others.

Forging ahead, scaling the BBC into Web/2.0 Dirk-Willem van Gulik Chief Technical Architect

1. Reconstruction and the West 1.1 Reconstruction: Americas Unfinished Revolution, 1865-1877

CYBER SECURITY IS OUR SHARED RESPONSIBILITY WHAT ARE WE DEALING WITH AND WHAT DO WE NEED TO DO?

Outline Public key crypto RSA Essentials Computer Security: Public Key Crypto Public Key Crypto

1 Host/Target A Big Problem with Debuggers gdb can be used to debug a program on a

Presented by Andrew Haas & John Ragsdale of TWR-Africa Featuring Theo Asare, founder of

Agency sends 16,000 tax SOFTWARE ENGINEERING forms to one man / 1 Today - motivation:

Sambuz

Useful Links

Newsletter

Mail Us

Multilingual and Noisy Data Challenges in Large-Scale Book Scanning - PowerPoint PPT Presentation

Noisy text in Google Books OCR in Google Research Challenges Multilingual and Noisy Data Challenges in Large-Scale Book Scanning Ashok C. Popat Staff Research Scientist, Google, Inc. September 17, 2011 Ashok C. Popat September 17, 2011 1 /

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

From multilingual documents to multilingual websites: challenges for international organizations

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research Wang

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Multi-parameter regularization for ill-posed problems with noisy right hand side and noisy

Noisy Channel Coding: Correlated Random Variables &amp; Communication over a Noisy Channel Toni

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e )

Learning Nearest Neighbor Graphs from Noisy Distance Samples Noisy Distance Samples Blake Mason,

Learning to denoise without clean data Joshua Batson hep-ai seminar 10/18/18 Noisy data is

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

1 Layering of Protocols Protocol FTP Client Mail client Web Many browser others.

Forging ahead, scaling the BBC into Web/2.0 Dirk-Willem van Gulik Chief Technical Architect

1. Reconstruction and the West 1.1 Reconstruction: Americas Unfinished Revolution, 1865-1877

CYBER SECURITY IS OUR SHARED RESPONSIBILITY WHAT ARE WE DEALING WITH AND WHAT DO WE NEED TO DO?

Outline Public key crypto RSA Essentials Computer Security: Public Key Crypto Public Key Crypto

1 Host/Target A Big Problem with Debuggers gdb can be used to debug a program on a

Presented by Andrew Haas &amp; John Ragsdale of TWR-Africa Featuring Theo Asare, founder of

Agency sends 16,000 tax SOFTWARE ENGINEERING forms to one man / 1 Today - motivation:

Sambuz

Useful Links

Newsletter

Mail Us

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

Presented by Andrew Haas & John Ragsdale of TWR-Africa Featuring Theo Asare, founder of