Historical Document Analysis Marcus Liwicki University of Fribourg University of Kaiserslautern Insiders Technologies GmbH marcus.liwicki@unifr.ch Marcus Liwicki, Historical Document Analysis
Typical Tasks of Scholars in the Humanities Cataloging Transcribing Searching Comparing texts … 2 It’s hard to find interesting and relevant doucments 2 Marcus Liwicki, Historical Document Analysis
State of the Art Tool in the Humanities: Catalogs But automatic methods can help! 3 Marcus Liwicki, Historical Document Analysis
Vision: D IVA Desk A scientific workbench for scholars 4 Marcus Liwicki, Historical Document Analysis
Outline Challenge: Why historical Documents? State-of-the-Art Recent Trends D IVA Services: Approach Towards Interoperability 5 Marcus Liwicki, Historical Document Analysis
What is the main Challenge? Data variation? 6 Marcus Liwicki, Historical Document Analysis
Data Variation Different languages and alphabets Writing style differs Quality of the images/data Changing writing instruments Abbreviations and misspellings Graphics & handwriting Language and writing evolves Annotations Change of support 7 Marcus Liwicki, Historical Document Analysis
What is the main Challenge? Data variation? Degradation? 8 Marcus Liwicki, Historical Document Analysis
What is the main Challenge? Data variation? Degradation? Communication between humanist scholars and computer science experts! 9 Marcus Liwicki, Historical Document Analysis
Communication between Humanist Scholars and DIA Experts Different expectations ^ Clearly defined challenging datasets vs. useful systems Bridging the gap is the biggest challenge Marcus Liwicki, Historical Document Analysis 10
Success in Computer Science ?!? HIP 2011 (27 papers accepted) ^ Information retrieval (text / graphic) ^ Projects But: ask a random scholar attending ^ Text/Character recognition + Calligraphy the Digital Humanities conference: ^ Visualization Do you know about HIP? ^ Digitization HIP 2013 (18 papers accepted) ^ Information Extraction and Retrieval ^ Reconstruction and Degradation ^ Text and Image Recognition ^ Segmentation, Layout Analysis and Databases HIP 2015 (18 papers accepted) ^ Text Transcription ^ Segmentation and Layout Analysis ^ Templates, Date Estimation, and Script Specific Approaches Thanks to Mickael Coustaty, IDAKS 2016 Marcus Liwicki, Historical Document Analysis 11
Overview of Projects on Hist-OCR * If you ask scholars who want to use the systems EU IMPACT Project (2008-2012) EU TRANSKRIBUS (2012-2016) EU READ (2016-now) CIS, LMU München, Post-OCR Correction OCR-D Projekt DFG (since 2015, 1.5 Mio books) Early Modern OCR Project, Texas A&M(2012-2015) Kallimachos (Uni Würzburg, 2014-2017) Ocular, University of California, Berkeley (2013-now) … Marcus Liwicki, Historical Document Analysis 12
Communication Problems and Approaches for Solution For Computer Science Experts: For Scholars in the Humanities • Not a unique representation of • Methods are not understandable knowledge • Not clear what 95% means • Same content has a lot of • Systems not accessible interpretation • Too specific solutions • A description is not shared by all scientists • Focus on different aspects We need more interdisciplinary discussions Reduce black box effects (describe methods, give examples) Approximate results are not enough Interfaces needed Alternatives to be reported Marcus Liwicki, Historical Document Analysis 13
Outline Challenge: Why historical Documents? State-of-the-Art Recent Trends D IVA Services: Approach Towards Interoperability Marcus Liwicki, Historical Document Analysis 14
Processing Steps of Automatic DIA Threshold (local, global) Top‐down vs bottom‐up ‐ Sauvola Classification Layout Preprocessing Binarization Analysis Information Classification OCR Extraction Marcus Liwicki, Historical Document Analysis 15
Layout Analysis Methods Based on connected components XY-cut Other histogram-based approaches Marcus Liwicki, Historical Document Analysis
Processing Steps of Automatic DIA Threshold (local, global) Top‐down ‐ Sauvola ‐ XY‐cut, histograms Classification Bottom‐up ‐ Connected components Layout Preprocessing Binarization Analysis Information Classification OCR Extraction Marcus Liwicki, Historical Document Analysis 17
Feature Extraction Marti, Bunke (2001) ^ Use a sliding window (similar to ASR) 1. Average grey value 2. Center of gravity 2 nd order moment vert. 3. 4. Uppermost pixel 5. Lowermost pixel 6. Gradient uppermost 7. Gradient lowermost 8. Number of b/w-transitions 9. #pix/d(upper,lower) Marcus Liwicki, Historical Document Analysis
Classification Machine learning methods for sequences ^ HMMs ^ Recurrent NNs 100 90 80 70 60 50 1 2 3 Marcus Liwicki, Historical Document Analysis
Bidirectional Long Short-Term Memory Network Features Importance of context Input Layer Hidden Layer Hidden Layer Hidden Layer Multilayer perceptron network Output Layer Recurrent connections Bidirectional Memory instead of perceptron Transcription November 1, 2007 Marcus Liwicki, Historical Document Analysis
Limits of MLP Limit: static input/output operation n x , , x y 1 Human brain is capable of memorizing Needed for solving many problems ^ Sequence recognition ^ Navigation through a labyrinth ^ Video analysis 1 1 T T 1 U ( x , , x ), , ( x , , x ) ( y , , y ) | U T 1 n 1 n Idea: add backward-connections to maintain state Marcus Liwicki, Historical Document Analysis
Recurrent Neural Networks (RNNs) Recurrent connections are added Features in order to keep information of previous time stamps in the Input Layer network Novel equation for the activation: Hidden Layer t t t 1 a w x w b i i h h Context information is used Output Layer How to train those networks …? Output Marcus Liwicki, Historical Document Analysis
Training of RNNs – Backpropagation Through Time Features t-k Input Layer t-k Features t-1 Hidden Layer t-k Features t Input Layer t-1 .... Input Layer t Hidden Layer t-1 Hidden Layer t Unfold the network in time ^ k timestamps (parameter) Output Layer t ^ Perform Backpropagation for output at t 0 t T 1 Output t Repeat this for each Marcus Liwicki, Historical Document Analysis
Recurrent Neural Networks (RNN) Recurrent connections are added in Features order to keep information of previous time stamps in the network Novel equation for activation: Input Layer t t t 1 a w x w b i i h h Hidden Layer Can be written in matrix form t t t 1 A W X W B i h Output Layer Context information is used, however: impossible to store precise Output information over long durations Marcus Liwicki, Historical Document Analysis
Vanishing Gradient Usual RNN forget information after a short period of time Example: Neuron 7 timestamps Information vanishes Marcus Liwicki, Historical Document Analysis
Core Idea: New Memory Cell Instead of Perceptron Marcus Liwicki, Historical Document Analysis
No Vanishing Gradient t t t 1 t a W X W B W S a , h , c , c Output Gate Output Neuron now O : open ( σ =1 ) | : closed ( σ =0 ) Marcus Liwicki, Historical Document Analysis
Bidirectional RNN Features t-1 Features t Features t+1 Input Layer t-1 Input Layer t Input Layer t+1 Forward Layer t-1 Forward Layer t Forward Layer t+1 Hidden Layer t-1 Hidden Layer t Hidden Layer t+1 Backw. Layer t-1 Backw. Layer t Backw. Layer t+1 Output Layer t-1 Output Layer t Output Layer t+1 Output t-1 Output t Output t+1 Trained with backpropagation through time (forward path trough all time stamps for each hidden layer sequentially) Marcus Liwicki, Historical Document Analysis
Connected Temporal Classification Additional blank label ( b green) Allows application to whole sequences Output with normalized likelihood for each word Training: objective function is smoothed and recalculated after each iteration (details in references) Testing: similar to HMM Viterbi-algorithm Marcus Liwicki, Historical Document Analysis
Processing Steps of Automatic DIA Threshold (local, global) Top‐down ‐ Sauvola ‐ XY‐cut, histograms Classification Bottom‐up ‐ Connected components Layout Preprocessing Binarization Analysis Information Classification OCR Extraction HMM on features LSTM with CTC New: MDLSTM on pixels Marcus Liwicki, Historical Document Analysis 30
Outline Challenge: Why historical Documents? State-of-the-Art Recent Trends D IVA Services: Approach Towards Interoperability Marcus Liwicki, Historical Document Analysis 31
Recommend
More recommend