qatip
play

QATIP An Optical Character Recognition System for Arabic Heritage - PowerPoint PPT Presentation

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016 OCR at the Qatar National Library Difficulty Modern Prints


  1. QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016

  2. OCR at the Qatar National Library Difficulty Modern Prints Modern Prints Historical Documents Historical Documents (known font) (unknown font) (early prints) (handwritten) • • • Basic image Train e.g. Sakhr Left untouched after scanning processing to the new font • • Standard OCR Several man- engines months • Minor error correction The QATIP system (this work) Very low error rates 2

  3. Challenges in Historical Documents Document aging Many ligatures effects Curved text lines Non-uniform background Overlapping characters/words 3

  4. Challenges in Historical Documents (cont ’ d) Gaps between Many ligatures connecting characters Ink erosions 4

  5. Challenges in Historical Documents (cont ’ d) 5

  6. Core Technologies • Kaldi Speech Recognition Toolkit • PrepOCRessor - Open source framework widely - Image processing tool for (Arabic) used in the speech recognition OCR community - Feature extraction for Kaldi - Segmentation-free (HMM+LSTM) - Developed at QCRI http://kaldi.sourceforge.net/ http://alt.qcri.org/tools/prepocressor/ 6

  7. Specializing ASR Technology to Arabic OCR • Position dependency • Glyph dependent model lengths (Dreuw et. al., 2009) • Dedicated sil and conn states between characters (Ahmad et. al., 2014) • Extended question set for decision tree generation (Likforman-Sulem et. al., 2012) 7

  8. Arabic OCR at QCRI • Ligature modeling using “ pronunciation variants ” Pronunciation dictionary Word Pronunciation زيجولا aaA laB waE jaB yaM zyE aaA laB waE hjLM zyE … … • Allowing all writing variants for each word blows up Problem 1 : pronunciation dictionary Dictionary size • → Restrict to only one ligature per word • Not enough training examples for all possible Problem 2 : ligatures Data sparseness • → Model ligatures without dots 8

  9. Arabic OCR at QCRI • Morphological language model (ATB) Standard ATB tokenization scheme: • Problem: Connecting characters across morpheme boundaries هذهبب + هذه Pronunciation dictionary Word Pronunciation ب + Shape of ه ( he ) depends on baB previous morpheme هذه heM dhE heA OR: heB dhE heA 9

  10. Arabic OCR at QCRI • Extended ATB tokenization scheme with “ = “ marker Pronunciation dictionary Word Pronunciation ب + baB هذهب = هذهب + هذه heB dhE heA هذه = heM dhE heA 10

  11. Arabic OCR at QCRI • Text image normalization (Stahlberg and Vogel, 2015) 11

  12. The QATIP System URL to the document (PDF, ZIP, … ) Historic or general content Three different output formats: - txt: Plain text files, one per page Automatic translation into English - xml: XML file with OCR plus page layout information - image: OCR results rendered into Accuracy/Runtime tradeoff original images 12

  13. http://www.primaresearch.org/tools/Aletheia Compatibility with Aletheia 13

  14. Job Monitoring 14

  15. Job Monitoring 15

  16. The QATIP Architecture 16

  17. Corpus #Lines #Word Tokens ALTEC 2,110 23,239 QATIP Training Data IFN/ENIT 42,736 42,736 KHATT 13,363 185,321 HADARA 1,319 16,587 • ALTEC corpus Sum 59,528 267,883 - Modern printed books (http://www.altec-center.org/) • IFN/ENIT database - Handwritten Tunisian town names (modern) (Pechwitz et. al., 2002) • KHATT corpus - Handwritten forms (modern) (Mahmoud et. al., 2012) • HADARA corpus - Historic handwritten Arabic (Pantke et. al., 2014) 17

  18. Results (QNL) Early Print Word Error Rate Character Error Rate Tesseract 99.6% 51.8% ABBYY 99.2% 54.8% Sakhr 126.8% 65.0% QATIP 37.5% 12.6% Manuscript Word Error Rate Character Error Rate Tesseract 99.4% 78.9% ABBYY 100.0% 85.2% Sakhr 99.4% 65.8% QATIP 84.6% 53.3% Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11 18

  19. Results (Modern Print – ALTEC) Book 1 (last 5 pages) Word Error Rate Character Error Rate Tesseract 29.2% 8.2% ABBYY 39.6% 10.9% Sakhr 27.1% 8.1% QATIP 40.8% 10.3% Book 8 (last 5 pages) Word Error Rate Character Error Rate Tesseract 38.3% 12.3% ABBYY 66.1% 24.0% Sakhr 57.1% 19.2% QATIP 40.5% 9.7% Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11 19

  20. Current Runtime of QATIP • 2.1 GHz, 8 core, 10 GB RAM 30 images per hour ~8.7 images per hour 14.5 images per hour 20

  21. How Fast is Fast Enough? • Scanning: 2,500 pages in 8 hours (= 1 working day) - 12,500 pages in 5 working days (= week) 12500 • OCR systems needs to process 7⋅24 ≈ 𝟖𝟓. 𝟓 pages per hour to keep up with a single operator Current Runtime of the QATIP system Document Required #Machines* per Operator Complexity Simple 2.5 Complex 5.1 *2.1 GHz, 8 core, 10 GB RAM 21

  22. Thank You

Recommend


More recommend