document layout analysis in scribo
play

Document layout analysis in SCRIBO Outline Introduction & Goals - PowerPoint PPT Presentation

Document layout analysis in SCRIBO Document layout analysis in SCRIBO Outline Introduction & Goals CSI Seminar - July 2011 Proposed approach Preprocessing Textlines extraction Julien MARQUEGNIES Statistical model Extraction


  1. Document layout analysis in SCRIBO Document layout analysis in SCRIBO Outline Introduction & Goals CSI Seminar - July 2011 Proposed approach Preprocessing Textlines extraction Julien MARQUEGNIES Statistical model Extraction Paragraphs EPITA Research and Development Laboratory extraction Linking lines Cutting lines July 2011 Results Conclusion Bibliography 1 / 31 Julien MARQUEGNIES

  2. Document layout analysis in SCRIBO Introduction & Goals Proposed approach Outline Introduction & Goals Preprocessing Proposed approach Textlines extraction Preprocessing Textlines extraction Statistical model Statistical model Extraction Extraction Paragraphs extraction Paragraphs extraction Linking lines Cutting lines Linking lines Results Cutting lines Conclusion Bibliography Results Conclusion Bibliography 2 / 31 Julien MARQUEGNIES

  3. Document layout analysis in SCRIBO Introduction & Goals Outline The extraction of the different structures of a digitalized Introduction & Goals document is based on the setup of a processing chain Proposed approach composed of crucial steps including : Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Simplified processing chain in SCRIBO. 3 / 31 Julien MARQUEGNIES

  4. Document layout analysis in SCRIBO Contribution Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction ◮ Textlines construction. Statistical model Extraction ◮ Paragraphs extraction. Paragraphs extraction ◮ Paragraphs polygon boundary. Linking lines Cutting lines Results Conclusion Bibliography 4 / 31 Julien MARQUEGNIES

  5. Document layout analysis in SCRIBO Document layout analysis (1 / 2) Outline Introduction & Goals Proposed approach Document layout analysis studies the physical and logical Preprocessing layout of a document image. Textlines extraction Statistical model ◮ Physical : segmentation into blocks of maximum size Extraction Paragraphs and classification into a set of definite types like lines, extraction Linking lines paragraphs, pictures. . . Cutting lines ◮ Logical : retrieval of information about text regions Results Conclusion (text reading order, titles, subtitles. . . ). Bibliography 5 / 31 Julien MARQUEGNIES

  6. Document layout analysis in SCRIBO Document layout analysis (2 / 2) Outline Introduction & Goals Proposed approach Two different categories for document layout analysis Preprocessing algorithms depending on how they process the document. Textlines extraction Statistical model ◮ Top-down : XY-Cut [7, 8, 9] and the whitespace Extraction Paragraphs analysis methods [10, 11, 12]. extraction Linking lines ◮ Bottom-up : smearing algorithms [13, 14, 15], Cutting lines Results Voronoi diagram-based algorithms [16, 17, 18] and Conclusion the Docstrum algorithm [19]. Bibliography 6 / 31 Julien MARQUEGNIES

  7. Document layout analysis in SCRIBO Font features Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Figure: Lines features [2]. Results Conclusion Bibliography 7 / 31 Julien MARQUEGNIES

  8. Document layout analysis in SCRIBO Proposed approach Outline Introduction & Goals Proposed approach Preprocessing ◮ A bottom-up approach strengthened by information Textlines extraction extracted from a top-down sight of the document. Statistical model Extraction ◮ Flexible to adapt on arbitrarily shaped regions. Paragraphs extraction ◮ Clustering based on the connected components Linking lines Cutting lines bounding boxes extension to form higher-level Results entities. Conclusion Bibliography 8 / 31 Julien MARQUEGNIES

  9. Document layout analysis in SCRIBO Processing chain Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Figure: Our document layout analysis processing chain. Bibliography 9 / 31 Julien MARQUEGNIES

  10. Document layout analysis in SCRIBO Preprocessing (1 / 3) Outline Introduction & Goals ◮ No a priori knowledge about the type of connected Proposed approach components. Preprocessing ◮ Clustering initialization. Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Figure: Groups after the linking step. Bibliography 10 / 31 Julien MARQUEGNIES

  11. Document layout analysis in SCRIBO Preprocessing (2 / 3) ◮ Delimiters extraction : physical delimiters and tab-stops. Outline Introduction & Goals ◮ Tab-stops : vertical alignments on textlines edges Proposed approach infered by margins and column edges which are all Preprocessing placed at a fixed x-position. Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Document with thin column spaces. 11 / 31 Julien MARQUEGNIES

  12. Document layout analysis in SCRIBO Preprocessing (3 / 3) Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Figure: Green lines are tab-stops. Orange lines are the Cutting lines Results tab-stops removed after filtering. Conclusion Bibliography 12 / 31 Julien MARQUEGNIES

  13. Document layout analysis in SCRIBO Statistical model (1 / 2) ◮ Bottom-up approaches are sensitive to the measures used to form higher-level entities → need of reliable Outline Introduction & Goals statistics. Proposed approach ◮ Our model heavily relies on baseline and mean line Preprocessing information to link groups of words. Textlines extraction ◮ How to compute the mean line and baseline ? Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Mean line and baseline estimation using the mean value. 13 / 31 Julien MARQUEGNIES

  14. Document layout analysis in SCRIBO Statistical model (2 / 3) Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Figure: Mean line and baseline estimation using the median Conclusion value. Bibliography 14 / 31 Julien MARQUEGNIES

  15. Document layout analysis in SCRIBO Statistical model (3 / 3) Proposed approach: Outline ◮ Clustering over the 1D values. Introduction & Goals ◮ Maximize ascent and descent on textlines. Proposed approach ◮ Better estimate than the mean and median over the Preprocessing 250 test documents. Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Mean line and baseline estimation using clustering. 15 / 31 Julien MARQUEGNIES

  16. Document layout analysis in SCRIBO Processing chain Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Textlines extraction processing chain. 16 / 31 Julien MARQUEGNIES

  17. Document layout analysis in SCRIBO Tagging Outline Introduction & Goals Proposed approach Preprocessing ◮ Determine which groups are likely to be textlines. Textlines extraction ◮ Simple conditions for groups with 3 characters or Statistical model Extraction more. Paragraphs extraction ◮ Special case for groups composed of only 2 Linking lines Cutting lines characters. Results ◮ Remainder considered as non-text. Conclusion Bibliography 17 / 31 Julien MARQUEGNIES

  18. Document layout analysis in SCRIBO Merging (1 / 3) ◮ Merging is done by using the extended bounding box of each line. Outline ◮ 7 anchors checked. Introduction & Goals Proposed approach ◮ Looking for intersections with other lines extended Preprocessing bounding boxes. Textlines extraction ◮ Use of baseline alignments and x-height similarities Statistical model Extraction for lines merging. Paragraphs extraction ◮ Specific conditions for non-text and text merging Linking lines Cutting lines (especially for punctuation). Results Conclusion Bibliography Figure: Extended bounding box anchors. 18 / 31 Julien MARQUEGNIES

  19. Document layout analysis in SCRIBO Merging (2 / 3) Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Textlines merging example. 19 / 31 Julien MARQUEGNIES

  20. Document layout analysis in SCRIBO Merging (3 / 3) Result: Outline Introduction & Goals Proposed approach Preprocessing Textlines extraction Figure: Groups after the linking step. Statistical model Extraction Paragraphs extraction Linking lines Cutting lines Results Conclusion Bibliography Figure: Textlines. 20 / 31 Julien MARQUEGNIES

Recommend


More recommend