Text/Graphics Separation Revisited Karl Tombre, Salvatore Tabbone, - PowerPoint PPT Presentation

' $ Text/Graphics Separation Revisited Karl Tombre, Salvatore Tabbone, Loïc Pélissier, Bart Lamiroy, Philippe Dosch August 2002 & %

' $ Text/Graphics Separation Revisited Text/Graphics Separation Text/Graphics Separation ➥ X–Y trees, white streams, etc. – adapted to text-rich documents ➥ RLSA filtering – but few attempts for graphics-rich documents ➥ Forms – mainly horizontal and vertical lines – look explicitely for lines (Hough, etc.) ➥ Directional morphological filtering ➥ Explicit search for lines on DT or vectorization ➥ Analysis of connected components → improvement on & % [Fletcher & Kasturi] Karl Tombre 1

' $ Text/Graphics Separation Revisited Text/Graphics Separation � � � � Why choose F&K? ❏ Because it’s there ❏ Stable on variety of documents ❏ Scalable ❏ Not many thresholds, and easy to master ❏ A reference method, well explained, sound, and known to many other people & % Karl Tombre 2

' $ Text/Graphics Separation Revisited Text/Graphics Separation � � � � Limitations of F&K ❏ Designed for mixed text–graphics documents ⇒ minor adaptations to graphics-rich documents (absolute constraint on length and width of component) ❏ Does not separate dashes from elongated symbols (I, l, ...) ⇒ separate size filtering from shape filtering, and add a third layer ❏ Text touching graphics ⇒ post-processing text recovery step & % Karl Tombre 3

Text/Graphics Separation Revisited Text/Graphics Separation ' $ � � � � Modified algorithm • compute CCs and histogram of BB sizes • find most populated area A mp and A avg , number of CCs of average size • set T 1 = n × max( A mp , A avg ) and T 2 (thresholds on BBs) • move to text layer all black CCs < T 1 , and height width ∈ [ 1 T 2 , T 2 ] , and both height and width < √ T 1 • compute best enclosing rectangle (BER) of each “text” component • set T 3 and T 4 on BERs • Reclassify “text” CCs with density (wrt BERs) > T 3 and elongation > T 4 as small & % elongated shapes Karl Tombre 4

' $ Text/Graphics Separation Revisited Text/Graphics Separation T 1 = 1 . 5 × max( A mp , A avg ) , T 2 = 20 , T 3 = 0 . 5 , T 4 = 2 & % Karl Tombre 5

' $ Text/Graphics Separation Revisited Text/Graphics Separation � � � � Stability of thresholds ✓ T 1 proportionnal to max( A mp , A avg ) , with n stable if only one character size ( n = 3 OK for very homogeneous character set) ✓ T 2 = 20 good for all documents we have worked on ✓ T 3 = 0 . 5 if noisy character contours (limitation of BER) ✓ T 4 dependent on kinds of dashes present in drawing & % Karl Tombre 6

Text/Graphics Separation Revisited Text/Graphics Separation ' $ � � � � Possible improvements ✓ Analysis of size and elongation distributions could be made less empirical ✓ Better elongation and size descriptor than BER (second-order moments) ✓ A fourth layer, that of dots (alignment problems in next step) ✓ Still, man must be in the loop... & % Karl Tombre 7

' $ Text/Graphics Separation Revisited Extracting the Strings Extracting the Strings Based on Hough Transform working on bounding boxes of text layer components: • sampling step of HT set to chdr × H avg • look for alignments by voting in ( ρ, θ ) space • segment each alignment into words: – compute mean height ¯ h – group all successive characters separated by less than µ × ¯ h & % Karl Tombre 8

' $ Text/Graphics Separation Revisited Extracting the Strings 2 options: 1. process first the highest votes of the HT, and do not consider characters already grouped in a first alignment when processing lower votes; 2. give the possibility to each character to be present in more than one word hypothesis, and wait until all votes are processed before eliminating multiple occurrences, by keeping the longest words. ⇒ No clear winner & % Karl Tombre 9

' $ Text/Graphics Separation Revisited Extracting the Strings � � � � Choice of parameters ✓ chdr : adjusts sampling step of HT. Difficult to stabilize – false clusters, or over-segmentation & % chdr = 0 . 2 chdr = 0 . 4 Karl Tombre 10

' $ Text/Graphics Separation Revisited Extracting the Strings ✓ µ : adjusts maximum distance allowed between characters in a same string. Default value 2.5 seems to be quite stable µ = 1 . 5 µ = 2 . 5 µ = 5 . 0 & % Karl Tombre 11

' $ Text/Graphics Separation Revisited Extracting the Strings � � � � Possible improvements ✓ Short strings not reliably detected → hierarchical strategy to refine thresholds when lowering string length ✓ Artificial diagonal alignments → heuristics on privileged directions ✓ Refinement of string orientation for short strings → post-processing by Radon transform for short strings (3–4 chars) ✓ Punctuation signs, points on “i” characters and other accents → extract them to a 4th layer and add them after string segmentation & % Karl Tombre 12

' $ Text/Graphics Separation Revisited Recovering Touching Characters Recovering Touching Characters ➥ General problem with CC based methods ➥ In our case, no a priori knowledge on orientation (such as in forms) or on stroke width ➥ General idea: extend strings found by previous step (thus, method does not work if everything touches!) & % Karl Tombre 13

' $ Text/Graphics Separation Revisited Recovering Touching Characters � � � � Outline of method • compute equation of best line passing through all string characters • compute enclosing rectangle of string along direction, and define search areas (circle if only 1 char in string) • look for characters in these areas, first in 3rd and 4th layer, then by segmenting skeleton & % Karl Tombre 14

' $ Text/Graphics Separation Revisited Recovering Touching Characters � � � � Segmentation of the Skeleton • Compute 3–4 distance skeleton in search area • Segment skeleton into subsets connected to skeleton outside search area by one and only one multiple point • Retrieve candidate character fragments • Reconstruct using inverse distance transform & % Karl Tombre 15

' $ Text/Graphics Separation Revisited Recovering Touching Characters � � � � Limitations ✓ method does not retrieve a string completely connected to the graphics (no seed string) ✓ if string orientation not correct (regression for short strings not robust), some characters may be missed ✓ heuristic leads to non extraction of characters intersecting search area at 2 ore more points & % Karl Tombre 16

' $ Text/Graphics Separation Revisited Recovering Touching Characters � � � � Evaluation Image Nb. ch. T/G Retr. Total Errors IMG1 63 50 (79%) 8/13 58 (92%) 7 IMG2 92 66 (72%) 5/16 71 (77%) 24 IMG3 93 78 (84%) 3/15 81 (87%) 5 IMG4 121 95 (78%) 9/26 104 (86%) 71 IMG5 31 7 (22%) 0/0 7 (22%) 1 & % Karl Tombre 17

' $ Text/Graphics Separation Revisited Conclusion Conclusion ✓ Robust, stable and well-mastered method ✓ Recovery of touching characters for a given class of problems ✓ Still room for improvements ✓ No panacea → we still need to put man in the loop & % Karl Tombre 18

Text/Graphics Separation Revisited Karl Tombre, Salvatore Tabbone, - PowerPoint PPT Presentation

' $ Text/Graphics Separation Revisited Karl Tombre, Salvatore Tabbone, Loc Plissier, Bart Lamiroy, Philippe Dosch August 2002 & % ' $ Text/Graphics Separation Revisited Text/Graphics Separation Text/Graphics Separation XY

Graphics Murray Cole Graphics 1 Graphics 2 Graphics 3 Graphics 4 Graphics 5 Graphics 6

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Separation energies A = 21 isobaric chain one-nucleon separation energies two-nucleon separation

CS378 - Mobile Computing 3D Graphics 2D Graphics android.graphics library for 2D graphics

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Graphics Processing CS418 Computer Graphics John C. Hart Graphics Processing Graphics

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Lecture 5: Data Representation 1 / 43 Data Representation Discussion Deep learning job postings

The strongSwan Project IPsec Workshop Dresden, March 2018 Tobias Brunner & Andreas Steffen

CS Lunch RABBIT TRACKS An Animated film by Luke Jaeger! A journey through a mortality-infused

MS degree in Computer Engineering University of Rome Tor Vergata Lecturer: Francesco Quaglia

Microprocessors & Interfacing Assembler directives Assembler expressions Macros

Introduction to Geometry Return to Table of Contents Slide 6 / 209 The Origin of Geometry

Operating Systems Processes Maria Hybinette, UGA Maria Hybinette, UGA Review Operating

CSCI 6730 / 4730 Operating Systems Processes Maria Hybinette, UGA Review Operating System

Text/Graphics Separation Revisited Karl Tombre, Salvatore Tabbone, - PowerPoint PPT Presentation

' $ Text/Graphics Separation Revisited Karl Tombre, Salvatore Tabbone, Loc Plissier, Bart Lamiroy, Philippe Dosch August 2002 & % ' $ Text/Graphics Separation Revisited Text/Graphics Separation Text/Graphics Separation XY

Graphics Murray Cole Graphics 1 Graphics 2 Graphics 3 Graphics 4 Graphics 5 Graphics 6

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Separation energies A = 21 isobaric chain one-nucleon separation energies two-nucleon separation

CS378 - Mobile Computing 3D Graphics 2D Graphics android.graphics library for 2D graphics

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Graphics Processing CS418 Computer Graphics John C. Hart Graphics Processing Graphics

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Lecture 5: Data Representation 1 / 43 Data Representation Discussion Deep learning job postings

The strongSwan Project IPsec Workshop Dresden, March 2018 Tobias Brunner &amp; Andreas Steffen

CS Lunch RABBIT TRACKS An Animated film by Luke Jaeger! A journey through a mortality-infused

MS degree in Computer Engineering University of Rome Tor Vergata Lecturer: Francesco Quaglia

Microprocessors &amp; Interfacing Assembler directives Assembler expressions Macros

Introduction to Geometry Return to Table of Contents Slide 6 / 209 The Origin of Geometry

Operating Systems Processes Maria Hybinette, UGA Maria Hybinette, UGA Review Operating

CSCI 6730 / 4730 Operating Systems Processes Maria Hybinette, UGA Review Operating System

The strongSwan Project IPsec Workshop Dresden, March 2018 Tobias Brunner & Andreas Steffen

Microprocessors & Interfacing Assembler directives Assembler expressions Macros