OPTICAL CHARACTER RECOGNITION Màster de Visió per Computador Curs 2006 - 2007 Outline • Introduction • Pre-processing (document level) – Binarization – Skew correction • Segmentation – Layout analysis – Character segmentation • Pre-processing (character level) • Feature extraction – Image-based features – Statistical features – Transform-based features – Structural features • Classification • Post-proccessing – Classifier combination – Exploitation of context information • Examples of OCR systems • Bibliography 2 Outline 1
Optical Character Recognition Pattern Recognition Statistical Structural Image Processing Pattern Recognition Pattern Recognition Methods Applications Optical Document Character Analysis Recognition 3 Introduction Some examples Drawings, maps Books, journals, reports License plates PDAs Identity Cards Old documents Cheques, bills Postal addresses Quality control 4 Introduction 2
Document Image Analysis G. Nagy: Twenty years of document image analysis in PAMI. IEEE Trans. on PAMI , vol. 22, nº 1, pp. 38-62 January 2000. Document image analysis is the subfield of digital image processing that aims at converting document images to symbolic form for modification, storage, retrieval, reuse and transmission. Document image analysis is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer. What is a document? Objects created expressly to convey information encoded as iconic symbols – Scanned images from paper documents – Electronic documents – Multimedia documents (video with text) – … 5 Introduction Applications of DIA Document Imaging: •Digitization •Storage •Compression •Re-printing Applications of DIA Document understanding •Recognition •Interpretation •Indexing •Retrieval 6 Introduction 3
DIA tasks Mostly text Mostly graphics •Acquisition •Acquisition •Binarization •Binarization Document Imaging •Filtering •Filtering •Skew correction •Vectorization •Segmentation •Text-graphics separation •Layout analysis Document Understanding •Symbol recognition •OCR •Interpretation 7 Introduction Outline of the course Focus: Document understanding of mostly text documents 1. Acquisition 2. Pre-processing Binarization − Skew correction − 3. Layout analysis 4. Character segmentation 5. OCR Feature extraction − Classification − Post-processing − 8 Introduction 4
Categorization of Character Recognition Optical Character Recognition According to the type of writing Machine-printed Hand-written character recognition character recognition According to the type of acquisition Off-line On-line character recognition character recognition 9 Introduction Machine-printed character recognition • Characters are totally defined by the font type: – Dimensions (segmentation) • Character width • Inter-character separation • Character height – Shape (recognition) • Typographic effects (boldface, italics, underline). • Challenges: – Similar shapes among characters – Multiple fonts – Joined characters – Digitization noise: broken lines, random noise, heavy characters, etc. – Document degradation: old documents, photocopies, etc. 10 Introduction 5
Machine-printed character recognition • Classification of machine-printed OCR systems – Monofont: • One single type of font – Multifont: • Recognition of a fixed and known set of fonts • It is necessary to identify and learn the differences between characters of all the types of fonts – Omnifont: • Recognition of any arbitrary type of font, even if it has not been previously learned 11 Introduction Off-line hand-written character recognition • Hand-written • Off-line: acquisition by a scanner or a camera • Challenges: – Shape variability among images of the same character – Character segmentation • Subproblems: – Hand-written numeral recognition: digit recognition – Hand-printed character recognition: well-separated characters – Cursive character recognition: non-separated characters 12 Introduction 6
On-line hand-written character recognition • On-line acquisition – Digitizer tablets – Digital Pen – Tablet PC • Advantages with respect to off-line acquisition: – Image is acquired while the text is written – We can take advantage of dynamic information: • Temporal information: writing order, stroke segmentation, etc • Writing speed • Pen pressure • Subproblems: – Cursive script recognition. – Signature verification/recognition. 13 Introduction Levels of difficulty in character recognition •S.Mori, H.Nishida, H.Yamada. Optical Character Recognition . John Wiley and sons. 1999. •S.V. Rice, G. Nagy, T.A. Nartker. Optical Character Recognition: An illustrated guide to the frontier . Kluwer Academic Publishers. 1999. • Little shape variability Level 0 0.0. Printed characters. Specific font. Constant size. Roman • Small number of characters alphabets. • Little noise 0.1. Constrained hand-printed characters. Arabic numerals. 1.0. Printed characters. Multiple fonts. Nº characters < 100 • Medium variation in shape Level 1 1.1. Loosely constrained hand-printed char. Nº char < 100 • Medium noise 1.2. Chinese characters of few fonts 1.3. Loosely constrained hand-printed char. Nº char ≈ 1000 Level 2 2.0. Printed characters of multiple fonts • Much variation in shape 2.1. Unconstrained hand-printed characters • Heavy noise 2.2. Affine transformed characters Level 3 3.0. Touching/broken characters • Nonsegmented strings of characters 3.1. Cursive handwriting characters 3.2. Characters on a textured background 14 Introduction 7
Levels of difficulty in character recognition Level 0 0.0. Printed character of a specific font with a constant size • Constant size • Connectivity of characters • Variation in the stroke thickness • Little noise 0.1. Constrained hand-printed characters • Characters are written according to some instructions or box guidelines Solved problem 15 Introduction Levels of difficulty in character recognition Level 1 1.0. Printed characters of multiple fonts 1.1. Loosely constrained hand-printed characters 1.2. Chinese characters of few fonts. 1.3. Loosely constrained hand-printed characters. Nº characters ≈ 1000 Solved problem 16 Introduction 8
Levels of difficulty in character recognition Level 2 2.0. Printed characters of multiple fonts 2.1. Unconstrained hand-printed characters 2.2. Affine transformed characters 17 Introduction Levels of difficulty in character recognition Level 3 3.0. Caràcters no separats o trencats 3.1. Cursive handwriting characters 3.2. Characters on a textured background 18 Introduction 9
Databases for OCR Off-line Hand-written characters On-line Machine- characters printed CEDAR CENPARMI NIST UNIPEN Univ. Washington • 50.000 • 17.000 • More than • Definition of a • More than segmented manually 1.000.000 format to 1500 pages of numerals from segmented characters from represent on- articles in zip codes numerals from forms line data english • 5.000 zip codes zip codes • Several learning • 4.500.000 • More than 500 • 5.000 city and test sets characters pages of articles names (more • Segmented in japanese • 9.000 state variability) characters, • Originals, names • 91.500 words and photocopies and sentences with a sentences pages with dictionary arificially generated noise • Page segmentation into labeled zones 19 Introduction Performance evaluation of OCR systems • Hand-printed Character Recognition – Institute for Posts and Telecommunications Policy (IPIP) – Japan - 1996 – 5000 hand-written numerals from japanese zip codes – Performance of the best system: 97.94% (human performance: 99.84%) • Machine-printed Character Recognition – “The fifth Annual Test of OCR Accuracy”. Information Science Research Institute. TR-96-01. April 1996. http://www.isri.unlv.edu – 5.000.000 characters from 2000 pages in journals, newspapers, letters and technical reports – Performance in good quality documents: 99.77% - 99.13% – Performance in medium quality documents: 99.27% - 98.21% – Performance in low quality documents: 97.01% - 89.34% • Performance 99% => 30 errors /page (3000 characters/page) 20 Introduction 10
Performance evaluation of OCR systems C.Y. Suen, J. Tan: Analysis of errors of handwritten digits made by a multitude of classifiers . PRL 26, pp. 369-379. 2005 Classifier N of test samples Recognition (%) Error (%) MNIST database 10,000 99.06 0.94 GPR 10,000 99.62 0.38 VSVMb 10,000 99.44 0.56 VSV2 10,000 99.18 0.82 LeNet-5 10,000 98.32 1.68 POE CENPARMI database 2000 98.7 1.3 VSVM USPS database 2007 97.66 2.34 VSVM NIST SD19 database 30,000 99.16 0.84 MLP 21 Introduction Performance evaluation of OCR systems Classifier N. of errors Category 1 Category 2 Category 3 MNIST database 94 24 11 59 GPR 38 15 6 17 VSVMb 56 15 9 32 VSV2 82 17 14 51 LeNet5 168 41 9 118 POE CENPARMI database 26 6 4 16 VSVM USPS database 47 13 13 21 VSVM NIST SD 19 database 119 30 8 81 MLP Sum 630 161 74 395 Percentage (%) 100 25.56 11.75 62.70 22 Introduction 11
Recommend
More recommend