an open source framework for integrating multi source
play

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND - PowerPoint PPT Presentation

AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS KAY-MICHAEL WRZNER KONSTANTIN BAIERER 1 . 1 Bibliotheca Baltica 2018 Rostock 2018-10-05 OVERVIEW 1. Why OCR-D 2. The


  1. AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS KAY-MICHAEL WÜRZNER KONSTANTIN BAIERER 1 . 1 Bibliotheca Baltica 2018 Rostock 2018-10-05

  2. OVERVIEW 1. Why OCR-D 2. The OCR-D initiative 3. Architecture 4. State of OCR-D tools 5. Scalability 6. Open source 2 . 1

  3. USERS WANT TEXT DATA 3 . 1

  4. MASSIVE AMOUNTS 3 . 2

  5. STRUCTURED 3 . 3

  6. EASILY ACCESSIBLE 3 . 4

  7. LIBRARIES PROVIDE TEXT DATA 4 . 1

  8. LARGE AMOUNTS 4 . 2

  9. UNSTRUCTURED 4 . 3

  10. HARD TO ACCESS 4 . 4

  11. WHY THE DISCREPANCY? 5 . 1

  12. UNDERSPECIFIED REQUIREMENTS ON OCR BY FUNDERS AND USERS 5 . 2

  13. OCR OF HISTORICAL DATA OF LITTLE ECONOMIC INTEREST LITTLE COMPETITION 5 . 3

  14. ACADEMICAL SOLUTIONS NON-SUSTAINABLE INFLEXIBLE WORKFLOWS 5 . 4

  15. THE OCR-D INITIATIVE DFG-organized expert Workshop (2014) Verfahren zur Verbesserung von OCR-Ergebnissen Result: A concerted effort for improving OCR is seen as required. 6 . 1

  16. OCR-D COORDINATION PROJECT 6 . 2

  17. PHASE 1: EXPLORING THE DOMAIN (2015-2017) Surveyed (open-source) ecosystem around OCR and OLR Identi�ed Tasks Prepared call for proposals for DFG 6 . 3

  18. PHASE 2: MODULE PROJECT STAGE (2018-2019) 6 . 4

  19. PHASE 3: GOING PRODUCTIVE (2018-2020) Integrate with existing digitization work�ow software, e.g. Kitodo Make OCR-D-developed software uniformly deployable Advise DFG on OCR requirements for "Praxisrichtlinien" 6 . 5

  20. NOT IN THIS TALK (BUT IN OCR-D) GROUND TRUTH FOR OCR OCR ENGINE TRAINING OCR RESEARCH DATA REPOSITORY WORKFLOW COMPOSITION AND PROVENANCE 6 . 6 O G S O O OC

  21. ARCHITECTURE 7 . 1

  22. "MULTI SOURCE" existing tools by OCR-D partners (tesseract, PoCoTo, LAREX...) new developments within OCR-D (font identi�cation, post-correction...) existing tools outside OCR-D (ocropus, kraken, ScanTailor, OLENA...) 7 . 2

  23. ฀ MODULAR ฀ ฀ MONOLITHIC ฀ 7 . 3

  24. SPECIFICATION + IMPLEMENTATION 7 . 4

  25. OCR-D/SPEC METS + PAGE-XML (+ ALTO) STRUCTURED TOOL DESCRIPTIONS COMMAND LINE INTERFACE HTTP INTERFACE 7 . 5

  26. ACTIONABLE DOCUMENTATION 7 . 6

  27. OCR-D/CORE VALIDATION AND HELPER FUNCTIONS PYTHON LIBRARY SHELL LIBRARY 7 . 7

  28. WHY PYTHON? Python widely used in computer vision and machine learning (keras, pytorch...) Wrapping existing tools with minimal friction (ocropus, kraken ...) Bindings for low-level APIs (opencv, tesserocr ...) 7 . 8

  29. WHY SHELL? Lowest common denominator Wrap arbitrary command line tools Process callout possible in every framework/work�ow engine/programming environment 7 . 9

  30. STATE OF THE OCR-D TOOLSET 8 . 1

  31. PREPROCESSING Tool Developer Functionality Wrapper anyOCR DFKI binarization, (python) Kaiserslautern cropping, deskewing, dewarping OLENA OCR-D binarization shell tesseract UB Mannheim, binarization python ASV Leipzig OCRopus OCR-D binarization python kraken OCR-D binarization python ImageMagick OCR-D binarization, shell conversion 8 . 2

  32. LAYOUT RECOGNITION Tool Developer Functionality Wrapper anyOCR DFKI block+line seg8n, block (python) Kaiserslautern class7n, document analysis LAREX Uni Würzburg block+line seg8n, block (shell) class7n OCRopus OCR-D line seg8n python kraken OCR-D line seg8n python tesseract UB Mannheim, block+line seg8n python ASV Leipzig dh_segment OCR-D block+line seg8n (shell) 8 . 3

  33. TEXT RECOGNITION Tool Developer Functionality Wrapper OCRopus OCR-D text recognition python kraken OCR-D text recognition python tesseract UB Mannheim, text recognition python ASV Leipzig calamari OCR-D text recognition (python) ocrad OCR-D text recognition (shell) 8 . 4

  34. POSTPROCESSING Tool Developer Functionality Wrapper corASV ASV Leipzig post correction (python) PoCoTo CIS München post correction python keraslm ASV Leipzig post correction python ocrevalUAtion OCR-D evaluation (shell) 8 . 5

  35. YOUR TOOL? 8 . 6

  36. SCALABILITY 9 . 1

  37. <IMPRESSIVE NUMBER HERE> 9 . 2

  38. GEARED TOWARDS REAL DIGITIZATION SCENARIOS Cooperation with Kitodo and commercial providers Frequent reality check with current practices ("Pilotbibliotheken") 9 . 3

  39. MODULARITY + UNIFORM INTERFACES ⇒ ADAPTIVE WORKFLOWS (Instantiation and composition up to users) 9 . 4

  40. OPEN SOURCE IS MORE THAN "OPEN SOURCE" 10 . 1

  41. STEP 1: GET FUNDED! 10 . 2

  42. STEP 2: DEVELOP! 10 . 3

  43. STEP 3: PUBLISH CODE! 10 . 4

  44. SUSTAINABILITY AND REUSE! 10 . 5

  45. BEST PRACTICES Transparency from day one Unit tests Uni�ed test assets Continuous Integration Semantic versioning Docker base image Releases to GitHub, PyPI, DockerHub 10 . 6

  46. COMMUNITY Issues Pull requests Code review Support chat 10 . 7

  47. OCR-D/DOCS DEVELOPER DOCUMENTATION "COOKBOOK" USER GUIDE DOCUMENTATION DOCUMENTATION 10 . 8

  48. ฀ THANK YOU ฀ ocr-d.de ocr-d.github.io ocr-d.github.io/docs github.com/OCR-D gitter.im/OCR-D/Lobby 11

Recommend


More recommend