historical document analysis
play

Historical Document Analysis Marcus Liwicki University of Fribourg - PowerPoint PPT Presentation

Historical Document Analysis Marcus Liwicki University of Fribourg University of Kaiserslautern Insiders Technologies GmbH marcus.liwicki@unifr.ch Marcus Liwicki, Historical Document Analysis Typical Tasks of Scholars in the Humanities


  1. Historical Document Analysis Marcus Liwicki University of Fribourg University of Kaiserslautern Insiders Technologies GmbH marcus.liwicki@unifr.ch Marcus Liwicki, Historical Document Analysis

  2. Typical Tasks of Scholars in the Humanities  Cataloging  Transcribing  Searching  Comparing texts  … 2  It’s hard to find interesting and relevant doucments 2 Marcus Liwicki, Historical Document Analysis

  3. State of the Art Tool in the Humanities: Catalogs  But automatic methods can help! 3 Marcus Liwicki, Historical Document Analysis

  4. Vision: D IVA Desk A scientific workbench for scholars 4 Marcus Liwicki, Historical Document Analysis

  5. Outline  Challenge: Why historical Documents?  State-of-the-Art  Recent Trends  D IVA Services: Approach Towards Interoperability 5 Marcus Liwicki, Historical Document Analysis

  6. What is the main Challenge?  Data variation? 6 Marcus Liwicki, Historical Document Analysis

  7. Data Variation  Different languages and alphabets  Writing style differs  Quality of the images/data  Changing writing instruments  Abbreviations and misspellings  Graphics & handwriting  Language and writing evolves  Annotations  Change of support 7 Marcus Liwicki, Historical Document Analysis

  8. What is the main Challenge?  Data variation?  Degradation? 8 Marcus Liwicki, Historical Document Analysis

  9. What is the main Challenge?  Data variation?  Degradation?  Communication between humanist scholars and computer science experts! 9 Marcus Liwicki, Historical Document Analysis

  10. Communication between Humanist Scholars and DIA Experts  Different expectations ^ Clearly defined challenging datasets vs. useful systems  Bridging the gap is the biggest challenge Marcus Liwicki, Historical Document Analysis 10

  11. Success in Computer Science ?!?  HIP 2011 (27 papers accepted) ^ Information retrieval (text / graphic) ^ Projects But: ask a random scholar attending ^ Text/Character recognition + Calligraphy the Digital Humanities conference: ^ Visualization Do you know about HIP? ^ Digitization  HIP 2013 (18 papers accepted) ^ Information Extraction and Retrieval ^ Reconstruction and Degradation ^ Text and Image Recognition ^ Segmentation, Layout Analysis and Databases  HIP 2015 (18 papers accepted) ^ Text Transcription ^ Segmentation and Layout Analysis ^ Templates, Date Estimation, and Script Specific Approaches Thanks to Mickael Coustaty, IDAKS 2016 Marcus Liwicki, Historical Document Analysis 11

  12. Overview of Projects on Hist-OCR * If you ask scholars who want to use the systems  EU IMPACT Project (2008-2012)  EU TRANSKRIBUS (2012-2016)  EU READ (2016-now)  CIS, LMU München, Post-OCR Correction  OCR-D Projekt DFG (since 2015, 1.5 Mio books)  Early Modern OCR Project, Texas A&M(2012-2015)  Kallimachos (Uni Würzburg, 2014-2017)  Ocular, University of California, Berkeley (2013-now)  … Marcus Liwicki, Historical Document Analysis 12

  13. Communication Problems and Approaches for Solution For Computer Science Experts: For Scholars in the Humanities • Not a unique representation of • Methods are not understandable knowledge • Not clear what 95% means • Same content has a lot of • Systems not accessible interpretation • Too specific solutions • A description is not shared by all scientists • Focus on different aspects  We need more interdisciplinary discussions  Reduce black box effects (describe methods, give examples)  Approximate results are not enough Interfaces needed  Alternatives to be reported  Marcus Liwicki, Historical Document Analysis 13

  14. Outline  Challenge: Why historical Documents?  State-of-the-Art  Recent Trends  D IVA Services: Approach Towards Interoperability Marcus Liwicki, Historical Document Analysis 14

  15. Processing Steps of Automatic DIA Threshold (local, global) Top‐down vs bottom‐up ‐ Sauvola Classification Layout Preprocessing Binarization Analysis Information Classification OCR Extraction Marcus Liwicki, Historical Document Analysis 15

  16. Layout Analysis Methods  Based on connected components  XY-cut  Other histogram-based approaches Marcus Liwicki, Historical Document Analysis

  17. Processing Steps of Automatic DIA Threshold (local, global) Top‐down ‐ Sauvola ‐ XY‐cut, histograms Classification Bottom‐up ‐ Connected components Layout Preprocessing Binarization Analysis Information Classification OCR Extraction Marcus Liwicki, Historical Document Analysis 17

  18. Feature Extraction  Marti, Bunke (2001) ^ Use a sliding window (similar to ASR) 1. Average grey value 2. Center of gravity 2 nd order moment vert. 3. 4. Uppermost pixel 5. Lowermost pixel 6. Gradient uppermost 7. Gradient lowermost 8. Number of b/w-transitions 9. #pix/d(upper,lower) Marcus Liwicki, Historical Document Analysis

  19. Classification  Machine learning methods for sequences ^ HMMs ^ Recurrent NNs 100 90 80 70 60 50 1 2 3 Marcus Liwicki, Historical Document Analysis

  20. Bidirectional Long Short-Term Memory Network Features  Importance of context Input Layer Hidden Layer Hidden Layer Hidden Layer Multilayer perceptron network Output Layer Recurrent connections Bidirectional Memory instead of perceptron Transcription November 1, 2007 Marcus Liwicki, Historical Document Analysis

  21. Limits of MLP  Limit: static input/output operation n  x , , x y 1   Human brain is capable of memorizing  Needed for solving many problems ^ Sequence recognition ^ Navigation through a labyrinth ^ Video analysis     1 1 T T 1 U ( x , , x ), , ( x , , x ) ( y , , y ) | U T     1 n 1 n  Idea: add backward-connections to maintain state Marcus Liwicki, Historical Document Analysis

  22. Recurrent Neural Networks (RNNs)  Recurrent connections are added Features in order to keep information of previous time stamps in the Input Layer network  Novel equation for the activation:    Hidden Layer   t t t 1 a w x w b i i h h  Context information is used Output Layer  How to train those networks …? Output Marcus Liwicki, Historical Document Analysis

  23. Training of RNNs – Backpropagation Through Time Features t-k Input Layer t-k Features t-1 Hidden Layer t-k Features t Input Layer t-1 .... Input Layer t Hidden Layer t-1 Hidden Layer t  Unfold the network in time ^ k timestamps (parameter) Output Layer t ^ Perform Backpropagation for output at t    0 t T 1 Output t  Repeat this for each Marcus Liwicki, Historical Document Analysis

  24. Recurrent Neural Networks (RNN)  Recurrent connections are added in Features order to keep information of previous time stamps in the network  Novel equation for activation: Input Layer      t t t 1 a w x w b i i h h Hidden Layer  Can be written in matrix form      t t t 1 A W X W B i h Output Layer  Context information is used, however: impossible to store precise Output information over long durations Marcus Liwicki, Historical Document Analysis

  25. Vanishing Gradient  Usual RNN forget information after a short period of time Example: Neuron 7 timestamps Information vanishes Marcus Liwicki, Historical Document Analysis

  26. Core Idea: New Memory Cell Instead of Perceptron Marcus Liwicki, Historical Document Analysis

  27. No Vanishing Gradient       t t t 1 t a W X W B W S     a , h , c , c  Output Gate  Output  Neuron now O : open ( σ =1 ) | : closed ( σ =0 ) Marcus Liwicki, Historical Document Analysis

  28. Bidirectional RNN Features t-1 Features t Features t+1 Input Layer t-1 Input Layer t Input Layer t+1 Forward Layer t-1 Forward Layer t Forward Layer t+1 Hidden Layer t-1 Hidden Layer t Hidden Layer t+1 Backw. Layer t-1 Backw. Layer t Backw. Layer t+1 Output Layer t-1 Output Layer t Output Layer t+1 Output t-1 Output t Output t+1  Trained with backpropagation through time (forward path trough all time stamps for each hidden layer sequentially) Marcus Liwicki, Historical Document Analysis

  29. Connected Temporal Classification  Additional blank label ( b green)  Allows application to whole sequences  Output with normalized likelihood for each word  Training: objective function is smoothed and recalculated after each iteration (details in references)  Testing: similar to HMM Viterbi-algorithm Marcus Liwicki, Historical Document Analysis

  30. Processing Steps of Automatic DIA Threshold (local, global) Top‐down ‐ Sauvola ‐ XY‐cut, histograms Classification Bottom‐up ‐ Connected components Layout Preprocessing Binarization Analysis Information Classification OCR Extraction HMM on features LSTM with CTC New: MDLSTM on pixels Marcus Liwicki, Historical Document Analysis 30

  31. Outline  Challenge: Why historical Documents?  State-of-the-Art  Recent Trends  D IVA Services: Approach Towards Interoperability Marcus Liwicki, Historical Document Analysis 31

Recommend


More recommend