OCR Post-Processing Michal Richter Noisy channel approach I - PowerPoint PPT Presentation

OCR Post-Processing Michal Richter

Noisy channel approach I  Scanning of the document and OCR introduce errors – noise  Post – processing step reduce the number of errors

Noisy channel approach II  Post – processing corrects one sentence at the time.  OCR output is modified by small amount of editing operations including: – single character insertion – single character deletion – single character substitution – multiple character substitution ( ab → ba ) – word split, word merge

Intuitive describtion  In post-processing we want to replace the input sequence of characters with another sequence of characters that is graphically similar and form the likeable sentence of the given language  These two aspect are handled separately

General form of the model P( O, S ) = P( O | S ) * P(S) O – output of the OCR system S – candidate sequence of character P( O | S ) – probability, that the sequence S will be recognized as O by OCR – corresponds to optical similarity between O and S – usually denoted as error model P( S ) – probability of S – corresponds to the likeabelness of the sequence S – this quantity should have greater value for well-formed sentences – denoted as language model

Language model – P( S )  Word based – Uses lexicon – sequence of characters is identified with the item in the lexicon – Smoothness of the sentence is ensured by word based n-gram model ( usually trigram ) – Problem: High coverage lexicon and huge amount of on-line text needed ( for n-gram model estimation )

Language model – P( S )  Character based – Smoothness of the sentence is ensured on the character level – No need of lexicon, lower amount of training data needed for language model estimation – Character based language model used (even 6-gram is possible)

Error model – P( O | S )  Levenshtein distance – Number of insertions, deletions and substitutions needed to transform input into the target – Example: LD between kitten and sitting is 3 kitten → sitten → sittin → sitting  Modified Levenshtein distance – Editing operations have different costs according to their probability – Example: low cost for in ↔ m, high cost for w ↔ R

Error model – P( O | S )  Word segmentation – Can be treated by word segmentation model P(O, b, a,C) = P(O, b|a,C)P(a|C)P(C) – Another possibility is to avoid special treatment of the space character – word segmentation errors are corrected via insertion/deletion of space character

Search of the correct sentence S  Viterbi decoding  Weighted Finite State Transducers – Language model and error model are represented in the form of finite state transducers – Make the composition of the automaton representing OCR output with the automaton representing error model and language model – Find the shortest path in the composed transducer – blackboard?

Post-correction accuracy measure  Word error rate metric

Post-correction accuracy  (Kolak, Resnik; 2005) – WER reduction up to 80% – African language Igbo – Character based model – Miniature size training data – 6727 words!

Post-correction for historical domain  Insufficient amount of training data ( if any )  Usually absence of high-coverage lexicons → This implies, that the use of word based approach is often impossible

References Okan Kolak; Philip Resnik. OCR Post- Processing for Low Density Languages. EMNLP-2005.

OCR Post-Processing Michal Richter Noisy channel approach I - PowerPoint PPT Presentation

OCR Post-Processing Michal Richter Noisy channel approach I Scanning of the document and OCR introduce errors noise Post processing step reduce the number of errors Noisy channel approach II Post processing corrects one

Process for OCR Audit and Remediation What is an OCR Complaint? How do I resolve an OCR

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer

ABBYY Fi ABBYY Fi ABBYY FineReader ABBYY FineReader R R d d OCR and PDF Conversion OCR and

M-Files OCR Presented By: Syed Raza What is OCR? OCR - Optical Character Recognition

What Does OCR Do? OCR enforces several civil rights laws. These laws prohibit discrimination on

OCR Level 2 ITQ - Unit 59 - Presentation Software Using OCR Level 2 ITQ - Unit 59 - Presentation

OCR Level 1 ITQ - Unit 58 - Presentation Software Using OCR Level 1 ITQ - Unit 58 - Presentation

Evaluating Binarization for OCR Donald B. Curtis MyFamily.com, Inc. Genealogical Data

Introduction to OCR ZHANG Xinyun SmartMore Outline Background Text Detection Text

OCR vs. text2Pitman ... Tell me about plans. OCR How old are you? It is time to close

Post Processing Effects By Michael Michuki What is Post processing? Post Processing is the

STAR-CCM+ Pre/Post Processing Bill Jester, CD-adapco Introduction Pre/Post Processing

Linda Weinerman, J.D. & Sheri Danz, J.D January 16, 2013 OCR Complaint Process 1) a)

Building an Open Community Runtime (OCR) framework for Exascale Systems Birds of a Feather

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September

Machine Learning sanparith.marukatat@nectec.or.th Today Example of intelligent system: OCR

Random braids: Stabilizing the Garside normal form Vincent Jug & Jean Mairesse Universit

'Tunde ADEGBOLA African Languages Technology Initiative (Alt-i) Ibadan, Nigeria. Supported by:

Overt Pro ronoun Subjects of f In Infinitival Cla lauses in in G G Deborah Naa

Immigrant barriers to Census completion Meeta Anand manand@nyic.org What barriers do immigrant

Local and Vocal Media training We acknowledge. FYA would like to acknowledge the custodians of

Improving the positive impact of disability services on the lives of Image Source: Red Dust

Program Development Manager WA Institute of Public Administration PREMIER Corporate Member

Effective Mentoring Program You will require a pen and paper for one of our activities 2020 - Day

OCR Post-Processing Michal Richter Noisy channel approach I - PowerPoint PPT Presentation

OCR Post-Processing Michal Richter Noisy channel approach I Scanning of the document and OCR introduce errors noise Post processing step reduce the number of errors Noisy channel approach II Post processing corrects one

Process for OCR Audit and Remediation What is an OCR Complaint? How do I resolve an OCR

OCR for CJK Mark Ravina CEAL Technology Forum 2018 I am an OCR end-user, not an OCR developer

ABBYY Fi ABBYY Fi ABBYY FineReader ABBYY FineReader R R d d OCR and PDF Conversion OCR and

M-Files OCR Presented By: Syed Raza What is OCR? OCR - Optical Character Recognition

What Does OCR Do? OCR enforces several civil rights laws. These laws prohibit discrimination on

OCR Level 2 ITQ - Unit 59 - Presentation Software Using OCR Level 2 ITQ - Unit 59 - Presentation

OCR Level 1 ITQ - Unit 58 - Presentation Software Using OCR Level 1 ITQ - Unit 58 - Presentation

Evaluating Binarization for OCR Donald B. Curtis MyFamily.com, Inc. Genealogical Data

Introduction to OCR ZHANG Xinyun SmartMore Outline Background Text Detection Text

OCR vs. text2Pitman ... Tell me about plans. OCR How old are you? It is time to close

Post Processing Effects By Michael Michuki What is Post processing? Post Processing is the

STAR-CCM+ Pre/Post Processing Bill Jester, CD-adapco Introduction Pre/Post Processing

Linda Weinerman, J.D. &amp; Sheri Danz, J.D January 16, 2013 OCR Complaint Process 1) a)

Building an Open Community Runtime (OCR) framework for Exascale Systems Birds of a Feather

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September

Machine Learning sanparith.marukatat@nectec.or.th Today Example of intelligent system: OCR

Random braids: Stabilizing the Garside normal form Vincent Jug &amp; Jean Mairesse Universit

'Tunde ADEGBOLA African Languages Technology Initiative (Alt-i) Ibadan, Nigeria. Supported by:

Overt Pro ronoun Subjects of f In Infinitival Cla lauses in in G G Deborah Naa

Immigrant barriers to Census completion Meeta Anand manand@nyic.org What barriers do immigrant

Local and Vocal Media training We acknowledge. FYA would like to acknowledge the custodians of

Improving the positive impact of disability services on the lives of Image Source: Red Dust

Program Development Manager WA Institute of Public Administration PREMIER Corporate Member

Effective Mentoring Program You will require a pen and paper for one of our activities 2020 - Day

Linda Weinerman, J.D. & Sheri Danz, J.D January 16, 2013 OCR Complaint Process 1) a)

Random braids: Stabilizing the Garside normal form Vincent Jug & Jean Mairesse Universit