AN OPEN-SOURCE FRAMEWORK FOR INTEGRATING MULTI-SOURCE LAYOUT AND TEXT RECOGNITION TOOLS INTO SCALABLE OCR WORKFLOWS KAY-MICHAEL WÜRZNER KONSTANTIN BAIERER 1 . 1 Bibliotheca Baltica 2018 Rostock 2018-10-05
OVERVIEW 1. Why OCR-D 2. The OCR-D initiative 3. Architecture 4. State of OCR-D tools 5. Scalability 6. Open source 2 . 1
USERS WANT TEXT DATA 3 . 1
MASSIVE AMOUNTS 3 . 2
STRUCTURED 3 . 3
EASILY ACCESSIBLE 3 . 4
LIBRARIES PROVIDE TEXT DATA 4 . 1
LARGE AMOUNTS 4 . 2
UNSTRUCTURED 4 . 3
HARD TO ACCESS 4 . 4
WHY THE DISCREPANCY? 5 . 1
UNDERSPECIFIED REQUIREMENTS ON OCR BY FUNDERS AND USERS 5 . 2
OCR OF HISTORICAL DATA OF LITTLE ECONOMIC INTEREST LITTLE COMPETITION 5 . 3
ACADEMICAL SOLUTIONS NON-SUSTAINABLE INFLEXIBLE WORKFLOWS 5 . 4
THE OCR-D INITIATIVE DFG-organized expert Workshop (2014) Verfahren zur Verbesserung von OCR-Ergebnissen Result: A concerted effort for improving OCR is seen as required. 6 . 1
OCR-D COORDINATION PROJECT 6 . 2
PHASE 1: EXPLORING THE DOMAIN (2015-2017) Surveyed (open-source) ecosystem around OCR and OLR Identi�ed Tasks Prepared call for proposals for DFG 6 . 3
PHASE 2: MODULE PROJECT STAGE (2018-2019) 6 . 4
PHASE 3: GOING PRODUCTIVE (2018-2020) Integrate with existing digitization work�ow software, e.g. Kitodo Make OCR-D-developed software uniformly deployable Advise DFG on OCR requirements for "Praxisrichtlinien" 6 . 5
NOT IN THIS TALK (BUT IN OCR-D) GROUND TRUTH FOR OCR OCR ENGINE TRAINING OCR RESEARCH DATA REPOSITORY WORKFLOW COMPOSITION AND PROVENANCE 6 . 6 O G S O O OC
ARCHITECTURE 7 . 1
"MULTI SOURCE" existing tools by OCR-D partners (tesseract, PoCoTo, LAREX...) new developments within OCR-D (font identi�cation, post-correction...) existing tools outside OCR-D (ocropus, kraken, ScanTailor, OLENA...) 7 . 2
MODULAR MONOLITHIC 7 . 3
SPECIFICATION + IMPLEMENTATION 7 . 4
OCR-D/SPEC METS + PAGE-XML (+ ALTO) STRUCTURED TOOL DESCRIPTIONS COMMAND LINE INTERFACE HTTP INTERFACE 7 . 5
ACTIONABLE DOCUMENTATION 7 . 6
OCR-D/CORE VALIDATION AND HELPER FUNCTIONS PYTHON LIBRARY SHELL LIBRARY 7 . 7
WHY PYTHON? Python widely used in computer vision and machine learning (keras, pytorch...) Wrapping existing tools with minimal friction (ocropus, kraken ...) Bindings for low-level APIs (opencv, tesserocr ...) 7 . 8
WHY SHELL? Lowest common denominator Wrap arbitrary command line tools Process callout possible in every framework/work�ow engine/programming environment 7 . 9
STATE OF THE OCR-D TOOLSET 8 . 1
PREPROCESSING Tool Developer Functionality Wrapper anyOCR DFKI binarization, (python) Kaiserslautern cropping, deskewing, dewarping OLENA OCR-D binarization shell tesseract UB Mannheim, binarization python ASV Leipzig OCRopus OCR-D binarization python kraken OCR-D binarization python ImageMagick OCR-D binarization, shell conversion 8 . 2
LAYOUT RECOGNITION Tool Developer Functionality Wrapper anyOCR DFKI block+line seg8n, block (python) Kaiserslautern class7n, document analysis LAREX Uni Würzburg block+line seg8n, block (shell) class7n OCRopus OCR-D line seg8n python kraken OCR-D line seg8n python tesseract UB Mannheim, block+line seg8n python ASV Leipzig dh_segment OCR-D block+line seg8n (shell) 8 . 3
TEXT RECOGNITION Tool Developer Functionality Wrapper OCRopus OCR-D text recognition python kraken OCR-D text recognition python tesseract UB Mannheim, text recognition python ASV Leipzig calamari OCR-D text recognition (python) ocrad OCR-D text recognition (shell) 8 . 4
POSTPROCESSING Tool Developer Functionality Wrapper corASV ASV Leipzig post correction (python) PoCoTo CIS München post correction python keraslm ASV Leipzig post correction python ocrevalUAtion OCR-D evaluation (shell) 8 . 5
YOUR TOOL? 8 . 6
SCALABILITY 9 . 1
<IMPRESSIVE NUMBER HERE> 9 . 2
GEARED TOWARDS REAL DIGITIZATION SCENARIOS Cooperation with Kitodo and commercial providers Frequent reality check with current practices ("Pilotbibliotheken") 9 . 3
MODULARITY + UNIFORM INTERFACES ⇒ ADAPTIVE WORKFLOWS (Instantiation and composition up to users) 9 . 4
OPEN SOURCE IS MORE THAN "OPEN SOURCE" 10 . 1
STEP 1: GET FUNDED! 10 . 2
STEP 2: DEVELOP! 10 . 3
STEP 3: PUBLISH CODE! 10 . 4
SUSTAINABILITY AND REUSE! 10 . 5
BEST PRACTICES Transparency from day one Unit tests Uni�ed test assets Continuous Integration Semantic versioning Docker base image Releases to GitHub, PyPI, DockerHub 10 . 6
COMMUNITY Issues Pull requests Code review Support chat 10 . 7
OCR-D/DOCS DEVELOPER DOCUMENTATION "COOKBOOK" USER GUIDE DOCUMENTATION DOCUMENTATION 10 . 8
THANK YOU ocr-d.de ocr-d.github.io ocr-d.github.io/docs github.com/OCR-D gitter.im/OCR-D/Lobby 11
Recommend
More recommend