QATIP An Optical Character Recognition System for Arabic Heritage - PowerPoint PPT Presentation

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016

OCR at the Qatar National Library Difficulty Modern Prints Modern Prints Historical Documents Historical Documents (known font) (unknown font) (early prints) (handwritten) • • • Basic image Train e.g. Sakhr Left untouched after scanning processing to the new font • • Standard OCR Several man- engines months • Minor error correction The QATIP system (this work) Very low error rates 2

Challenges in Historical Documents Document aging Many ligatures effects Curved text lines Non-uniform background Overlapping characters/words 3

Challenges in Historical Documents (cont ’ d) Gaps between Many ligatures connecting characters Ink erosions 4

Challenges in Historical Documents (cont ’ d) 5

Core Technologies • Kaldi Speech Recognition Toolkit • PrepOCRessor - Open source framework widely - Image processing tool for (Arabic) used in the speech recognition OCR community - Feature extraction for Kaldi - Segmentation-free (HMM+LSTM) - Developed at QCRI http://kaldi.sourceforge.net/ http://alt.qcri.org/tools/prepocressor/ 6

Specializing ASR Technology to Arabic OCR • Position dependency • Glyph dependent model lengths (Dreuw et. al., 2009) • Dedicated sil and conn states between characters (Ahmad et. al., 2014) • Extended question set for decision tree generation (Likforman-Sulem et. al., 2012) 7

Arabic OCR at QCRI • Ligature modeling using “ pronunciation variants ” Pronunciation dictionary Word Pronunciation زيجولا aaA laB waE jaB yaM zyE aaA laB waE hjLM zyE … … • Allowing all writing variants for each word blows up Problem 1 : pronunciation dictionary Dictionary size • → Restrict to only one ligature per word • Not enough training examples for all possible Problem 2 : ligatures Data sparseness • → Model ligatures without dots 8

Arabic OCR at QCRI • Morphological language model (ATB) Standard ATB tokenization scheme: • Problem: Connecting characters across morpheme boundaries هذهبب + هذه Pronunciation dictionary Word Pronunciation ب + Shape of ه ( he ) depends on baB previous morpheme هذه heM dhE heA OR: heB dhE heA 9

Arabic OCR at QCRI • Extended ATB tokenization scheme with “ = “ marker Pronunciation dictionary Word Pronunciation ب + baB هذهب = هذهب + هذه heB dhE heA هذه = heM dhE heA 10

Arabic OCR at QCRI • Text image normalization (Stahlberg and Vogel, 2015) 11

The QATIP System URL to the document (PDF, ZIP, … ) Historic or general content Three different output formats: - txt: Plain text files, one per page Automatic translation into English - xml: XML file with OCR plus page layout information - image: OCR results rendered into Accuracy/Runtime tradeoff original images 12

http://www.primaresearch.org/tools/Aletheia Compatibility with Aletheia 13

Job Monitoring 14

Job Monitoring 15

The QATIP Architecture 16

Corpus #Lines #Word Tokens ALTEC 2,110 23,239 QATIP Training Data IFN/ENIT 42,736 42,736 KHATT 13,363 185,321 HADARA 1,319 16,587 • ALTEC corpus Sum 59,528 267,883 - Modern printed books (http://www.altec-center.org/) • IFN/ENIT database - Handwritten Tunisian town names (modern) (Pechwitz et. al., 2002) • KHATT corpus - Handwritten forms (modern) (Mahmoud et. al., 2012) • HADARA corpus - Historic handwritten Arabic (Pantke et. al., 2014) 17

Results (QNL) Early Print Word Error Rate Character Error Rate Tesseract 99.6% 51.8% ABBYY 99.2% 54.8% Sakhr 126.8% 65.0% QATIP 37.5% 12.6% Manuscript Word Error Rate Character Error Rate Tesseract 99.4% 78.9% ABBYY 100.0% 85.2% Sakhr 99.4% 65.8% QATIP 84.6% 53.3% Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11 18

Results (Modern Print – ALTEC) Book 1 (last 5 pages) Word Error Rate Character Error Rate Tesseract 29.2% 8.2% ABBYY 39.6% 10.9% Sakhr 27.1% 8.1% QATIP 40.8% 10.3% Book 8 (last 5 pages) Word Error Rate Character Error Rate Tesseract 38.3% 12.3% ABBYY 66.1% 24.0% Sakhr 57.1% 19.2% QATIP 40.5% 9.7% Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11 19

Current Runtime of QATIP • 2.1 GHz, 8 core, 10 GB RAM 30 images per hour ~8.7 images per hour 14.5 images per hour 20

How Fast is Fast Enough? • Scanning: 2,500 pages in 8 hours (= 1 working day) - 12,500 pages in 5 working days (= week) 12500 • OCR systems needs to process 7⋅24 ≈ 𝟖𝟓. 𝟓 pages per hour to keep up with a single operator Current Runtime of the QATIP system Document Required #Machines* per Operator Complexity Simple 2.5 Complex 5.1 *2.1 GHz, 8 core, 10 GB RAM 21

Thank You

QATIP An Optical Character Recognition System for Arabic Heritage - PowerPoint PPT Presentation

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016 OCR at the Qatar National Library Difficulty Modern Prints

Outcomes, Savings and Growth at your ASC Josh Christensen VP, ASC Initiatives Supporting

GEC-ESTRO William Small Jr, MD, FACRO, FACR, FASTRO Professor and Chairman Loyola University

Welcome to Winter Parent Night Please see your childs teacher to sign in and collect your

Muscles of the Arm and Hand PSK 4U MR. S. KELLY NORTH GRENVILLE DHS Biceps Brachii Origin:

Lets Get Real! Trends and Directives That Dont Really Work E-23 Barbara Dellinger, MA, Teri

Fuori dalla torre di Babele: interoperabilit e sistemi grafjci pre-moderni Out of the Tower of

Developing Global Applications in Java Richard Gillam Unicode Technology group Center for Java

Arabic typesetting using a METAFONT -based dynamic font Amine Anane ananeamine@gmail.com TUG

CSS Injection Attacks or how to leak content with <style> * { author: Pepe Vila; year:

Pango An open-source Unicode text layout engine Owen Taylor otaylor@redhat.com 25th

1/4/2018

conceptual designof software Daniel Jackson Northeastern University December 2014 sad

IHI Expedition Expedition: Making Mental Health Care Safer in the Hospital Setting Session 6:

( G O T Y O U R NO S E ) ( How Attackers steal your precious Data without using Scripts )

Supplies to http://bit.ly/ClarinetHacks Go as slow as possible first semester, Pencil grips

April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference

Angular Material Design Whats New in Angular Material Design Whats Cool in Material Design

Hospital-Based Assessment of Depression and Suicide Itai Danovitch, MD, MBA Chairman, Dept of

Vulnerability in under ones national and local context James Dunne Designated Nurse Fiona Finlay

Etiquette and dirty tricks in L A T EX Jephian C.-H. Lin Department of Applied

Chapter 4 Illuminated Manuscripts Monday, February 1, 16 Beautiful detail of a goldleaf

Types (different views) Value collection values having the same properties can be

Assignment 1 Design and implement an interactive tool for creating the layout of comic strips UI

9/21/19 Gut Health and Grains Body Weight and Breakfast Dr Joanna McMillan Dr Flavia

Sambuz

Useful Links

Newsletter

Mail Us

QATIP An Optical Character Recognition System for Arabic Heritage - PowerPoint PPT Presentation

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016 OCR at the Qatar National Library Difficulty Modern Prints

Outcomes, Savings and Growth at your ASC Josh Christensen VP, ASC Initiatives Supporting

GEC-ESTRO William Small Jr, MD, FACRO, FACR, FASTRO Professor and Chairman Loyola University

Welcome to Winter Parent Night Please see your childs teacher to sign in and collect your

Muscles of the Arm and Hand PSK 4U MR. S. KELLY NORTH GRENVILLE DHS Biceps Brachii Origin:

Lets Get Real! Trends and Directives That Dont Really Work E-23 Barbara Dellinger, MA, Teri

Fuori dalla torre di Babele: interoperabilit e sistemi grafjci pre-moderni Out of the Tower of

Developing Global Applications in Java Richard Gillam Unicode Technology group Center for Java

Arabic typesetting using a METAFONT -based dynamic font Amine Anane ananeamine@gmail.com TUG

CSS Injection Attacks or how to leak content with &lt;style&gt; * { author: Pepe Vila; year:

Pango An open-source Unicode text layout engine Owen Taylor otaylor@redhat.com 25th

1/4/2018

conceptual designof software Daniel Jackson Northeastern University December 2014 sad

IHI Expedition Expedition: Making Mental Health Care Safer in the Hospital Setting Session 6:

( G O T Y O U R NO S E ) ( How Attackers steal your precious Data without using Scripts )

Supplies to http://bit.ly/ClarinetHacks Go as slow as possible first semester, Pencil grips

April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference

Angular Material Design Whats New in Angular Material Design Whats Cool in Material Design

Hospital-Based Assessment of Depression and Suicide Itai Danovitch, MD, MBA Chairman, Dept of

Vulnerability in under ones national and local context James Dunne Designated Nurse Fiona Finlay

Etiquette and dirty tricks in L A T EX Jephian C.-H. Lin Department of Applied

Chapter 4 Illuminated Manuscripts Monday, February 1, 16 Beautiful detail of a goldleaf

Types (different views) Value collection values having the same properties can be

Assignment 1 Design and implement an interactive tool for creating the layout of comic strips UI

9/21/19 Gut Health and Grains Body Weight and Breakfast Dr Joanna McMillan Dr Flavia

Sambuz

Useful Links

Newsletter

Mail Us

CSS Injection Attacks or how to leak content with <style> * { author: Pepe Vila; year: