Document Page Layout Analysis Document Page Layout Analysis Bhabatosh Chanda Electronics and Communication Sciences Unit Indian Statistical Institute Indian Statistical Institute Kolkata 700108, India
Acknowledgement Acknowledgement • Amit Das IIEST Sibpur Amit Das , IIEST, Sibpur • Sekhar Mandal, IIEST, Sibpur • Sanjoy Kumar Saha, Jadavpur Univeristy S j S h • Ranjan Mandal, Indian Statistical Institute January 30, 2017 Indian Statistical Institute 2
Outline Outline • Introduction • Projection method – Zone content classification • Morphological operators – Skew correction • Morphology based method Morphology based method • Deep learning based method • Performance evaluation • Database: examples • Conclusion January 30, 2017 Indian Statistical Institute 3
Introduction Introduction Problem description Problem description • Motivation • Improve performance of OCR I f f OCR • Data compression • Graphics recognition • Browsing and navigation • Physical and logical structure • January 30, 2017 Indian Statistical Institute 4
Problem Description Problem Description January 30, 2017 Indian Statistical Institute 5
Objective Objective January 30, 2017 Indian Statistical Institute 6
Major Source of Document Pages Major Source of Document Pages 1 1. Books Books 2. Journals 3. Magazines 3 i 4. Newspapers 5. Forms and leaflets 6 6. Reports Reports January 30, 2017 Indian Statistical Institute 7
Types of document pages Types of document pages Consider books and journals Consider books and journals • Title page • Publisher’s page bli h ’ • Table of Contents • Text page • Index page Index page January 30, 2017 Indian Statistical Institute 8
Different types of pages Different types of pages Title page Title page Publisher’s page Publisher s page January 30, 2017 Indian Statistical Institute 9
Different types of pages Different types of pages Table of Content page Table of Content page Table of Content page Table of Content page January 30, 2017 Indian Statistical Institute 10
Different types of pages Different types of pages Text page ‐ 1 Text page ‐ 1 Text page ‐ 2 Text page ‐ 2 January 30, 2017 Indian Statistical Institute 11
Different types of pages Different types of pages Text page ‐ 3 Text page ‐ 3 Index page Index page January 30, 2017 Indian Statistical Institute 12
Issues in document page scanning Issues in document page scanning Resolution Resolution • Back page impression • Granular noise G l i • Blotted text (specially in old documents) • Bending of pages at the binding • Skew Skew • (due to placement of the page in the scanner) January 30, 2017 Indian Statistical Institute 13
Entities of Document Page Entities of Document Page Text Text • – Body text Line Word Character Line Word Character • • – Heading Non ‐ text Non text • • – Half ‐ tone – Table T bl – Graphics or line drawing January 30, 2017 Indian Statistical Institute 14
Entities of Document Page Entities of Document Page • Each detected zone or block must be homogeneous Each detected zone or block must be homogeneous in terms of content or entity • Each zone will be input to one of the suitable p modules based on entity. – OCR system – Image compressor – Vectorization system • Output of these modules may be compiled and archived using suitable structure. January 30, 2017 Indian Statistical Institute 15
Geometrical / Physical structure Geometrical / Physical structure Page Non ‐ Block c Word text h a r Page Block Document . Word a . Line . . . c t . e . Block Line r . . Word s Page Line January 30, 2017 Indian Statistical Institute 16
Logical structure Logical structure Document Text Non ‐ Text Normal Normal High ‐ lighted High ‐ lighted Half ‐ tone lf Line i (image) drawing Body Heading Graphics Abstract Sub ‐ heading Table January 30, 2017 Indian Statistical Institute 17
Logical structure Logical structure • Different entities: Different entities: – Text (red box) – Halftone (green box) – Table (magenta box) – Line drawing (blue box) • Reading direction (dark blue arrow) • Link between entities (brown arrow) January 30, 2017 Indian Statistical Institute 18
Zone / block detection Zone / block detection • One of the simple way is Projection method. One of the simple way is Projection method. • Algorithm – Take horizontal (or vertical) projection of foreground Take horizontal (or vertical) projection of foreground pixels. (may be implemented as pixel count) – If there exists a characteristic change in projection profile, put a horizontal (resp. vertical) separator. h i l ( i l) – Take horizontal and vertical direction alternately. – Continue, until above condition is satisfied. Continue until above condition is satisfied • Works well for structured document , usually the pages of technical journals, books, etc. January 30, 2017 Indian Statistical Institute 19
Projection Method: An Example Projection Method: An Example January 30, 2017 Indian Statistical Institute 20
Example (contd.) Example (contd.) January 30, 2017 Indian Statistical Institute 21
Example (contd.) January 30, 2017 Indian Statistical Institute 22
Example (contd.) January 30, 2017 Indian Statistical Institute 23
Problems of Projection method Problems of Projection method Cannot say what each block contains until further Cannot say what each block contains until further • analysis. Extract feature s from a zone – Recognize the zone content using a classifier – Results are highly dependent even on small skew in • the scanned page. January 30, 2017 Indian Statistical Institute 24
Zone content recognition Zone content recognition Features: • Black pixel ratio (no. of black pixel / zone area) • Horizontal transition (black to white) count • Vertical transition (black to white) count • Normalized mean length of horizontal black pixel run • Normalized mean length of vertical black pixel run • Normalized mean length of vertical black pixel run • Connected component ratio Classifier: • Two ‐ class (text and non ‐ text) SVM with RBF kernel (accuracy 94.89%) Duong, Emptoz, Côté: Features for Printed Document Image Analysis, ICPR 2002. January 30, 2017 Indian Statistical Institute 25
Zone content recognition Zone content recognition • Functional classification of text blocks u c o a c ass ca o o e b oc s – Title / Heading, Sub ‐ heading, Body text … • Features: – complexity (measured by entropy) – visibility values (or relative boldness) – directional compactness (horizontal and vertical) di i l (h i l d i l) – geometric characteristics (block height, width, etc.) • Classifier: Classifier: – K ‐ means clustering followed by min. distance classifier Bres, Eglin, and Gafneux, Unsupervised Clustering of Text Entities in Heterogeneous Grey Level Documents, ICPR, 2002. January 30, 2017 Indian Statistical Institute 26
Problems of Projection method Problems of Projection method Cannot say what each block contains until further Cannot say what each block contains until further • analysis Extract feature s from a zone – Recognize the zone content using a classifier – Results are highly dependent even on small skew in • the scanned page Detecting base line of each text line of the document – Determining orientation (slope) angle of base line – Estimation overall skew of the document page – January 30, 2017 Indian Statistical Institute 27
Processing Tool Processing Tool Spatial domain operator that can handle Spatial domain operator that can handle • shape information directly Mathematically well defined Mathematically well defined • • Neighborhood operator such that hardware • i implementation should be simple l i h ld b i l January 30, 2017 Indian Statistical Institute 28
Mathematical Morphology Mathematical Morphology • Mathematical morphological operators are Mathematical morphological operators are good choice. Objects Objects • All characters, figures, drawing, i.e., black components against white background components against white background Structuring element • Regular geometric figures: R l i fi – mostly line segment, square, circle, etc. January 30, 2017 Indian Statistical Institute 29
Morphological Operations Morphological Operations Set theoretic operations (including union, intersection, etc.): 1. Dilation 1. Dilation 2. Erosion 3. Opening 4. Closing January 30, 2017 Indian Statistical Institute 30
Morphological operator: Dilation Morphological operator: Dilation Orig. • Expands the objects. p j | , A B a b a A b B SE: where A is an object and Circ ‐ 5 B is SE. • Properties: Circ ‐ 9 Commutative, associative associative, distributive (over union), Line ‐ 19 increasing g January 30, 2017 Indian Statistical Institute 31
Morphological operator: Erosion Morphological operator: Erosion Orig. • Shrinks the objects. j | A B p B p A SE: where A is an object and Circ ‐ 5 B is SE. • Properties: Circ ‐ 9 Distributive (over intersection), increasing increasing. • Dilation and erosion are dual. Line ‐ 19 January 30, 2017 Indian Statistical Institute 32
Recommend
More recommend