Analyzing and Improving the Quality of a Historical News Collection - PowerPoint PPT Presentation

Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods Kimmo Kettunen 1 , Timo Honkela 1,2 , Krister Lindén 2 , Pekka Kauppinen 2 , Tuula Pääkkönen 1 & Jukka Kervinen 1 2 1 Presented by Timo Honkela in IFLA Pre-Conference Geneva, Switzerland, 13th of August, 2014 Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Department of Modern Languages Language Technology Center for Preservation and Digitisation HELSINKI MIKKELI Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

HonkeLA KettuNEN KauppiNEN PääkköNEN KerviNEN Lindén www.fmi.fi http://oppimateriaalit.internetix.fi Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Structure of the presentation ● Some background on the digitalization process ● Introducing the paper content: analysis and correction of OCR results ● Discussion on future steps: In-depth analysis of newspaper contents to promote research in humanities and social sciences Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Historical newspaper collection ● The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001, 2005) . ● This collection contains approximately 1.95 million pages in Finnish and Swedish ● According to Legal Deposit law, the National Library of Finland receives a copy of each newspaper and magazine published in Finland. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Digitisation of the historical newspaper collection ● In the post-processing phase, the material is processed so that it can be shared to the library sector, researchers, and the wide public. ● The scanned images are enhanced and run through background software and processes which create METS/ALTO metadata (CCS Docworks) ● The optical character recognition (OCR) is conducted at the same time in order to get the text content from the materials. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Two channels ● Search and exploration interface (“Digi”) – Approximate search, focusing based on time/place, indexed contents, index creation using morphological analysis, etc. – Digitalkoot: enables the public to collectively mark and collect articles (crowdsourcing) ● Corpus (FIN-CLARIN) – Mainly used by linguists – Includes keyword-in-context (n-gram) view – Morphological and syntactical analysis results Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Search interface http://digi.kansalliskirjasto.fi Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

FIN-CLARIN corpus www.kielipankki.fi Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

OCR Challenges ● Regardless of recent development of the OCR software, there are still challenges with it, as some material is very old, with – varying paper and print quality, – varying number of columns and layout patterns, – different languages (mainly Finnish and Swedish but also French, German, etc.), and – and varying font types (fraktur and antiqua) Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

OCR Challenges ● The amount of material is such that human efforts – even crowdsourced – can only be a partial solution ● Fully or partially automated processes are needed Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

A very long tail of low frequency forms... Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

zzhdysvautki Yhdyspankki u, n, ll ? v, u, p ? Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

tavallisuuden taioafliftiutpn Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Sources of complexity Word (lexeme) Inflections Historical differences Typos Recognition errors “Recognized” surface word Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Inflections: Complexity of Finnish at the level of word forms Kimmo Koskenniemi (2013): Johdatus kieliteknologiaan, sen merkitykseen ja sovelluksiin (Introduction to language technology, its significance and applications) https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1 Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Typos Not a major source of problem but they do exist Most likely not a stain Basel Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Historical differences ● All the time, new names and words are being introduced ● Even more static morphological aspects evolve over centuries Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Net outcome ● A collection of millions of newspaper pages gives rise to a list of hundreds of millions of different word forms that have been found in the process ● A large proportion of these forms is not correct Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Detection and correction ● Improving OCR quality – not considered here ● Improving the OCR output based on linguistic knowledge and statistical considerations – Detecting incorrect forms – Correcting the incorrect form Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Introduction to the basic ideas ● Detection Please see – Morphological analyzer the paper for – Special dictionaries (e.g. names) methodological details and – N-grams analysis results ● Correction – Transformation rules created through a supervised learning scheme – Edit distance approach using corpus statistics – Weighted edit distance based on letter shapes – Future: context information (problem of sparsity) Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Similarity diagram of Fraktur letter shapes (a self-organizing map) Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Research direction Socio-Historical Text Mining of Newspaper Collections Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Areas of analysis ● Named entity recognition (people, organizations, places, events) ● Time series analysis cf. Virginie Fortun's ● Social network analysis presentation ● Topic modeling Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Areas of analysis ● Multidimensional sentiment analysis ● Analysis of social and historical context ● Intercultural and multilingual analysis ● Analysis of point of view ● Analysis of subjective Stella Wisdom & Neil Smyth understanding Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Earlier related results Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Learning meaning from context: Maps of words in Grimm fairy tales Honkela, Pulkki & Kohonen 1995 Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Multidimensional sentiment using the PERMA model ● Seligman and his colleagues has developed the PERMA model that addresses different aspects of wellbeing. ● The model includes five components related to subjective well-being: – Positive emotion (P), – Engagement (E), – Relationships (R), – Meaning (M) and – Achievement (A) Honkela, Korhonen, Lagus & Saarinen 2014 Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

PERMA profiles of different corpora Honkela, Korhonen, Lagus & Saarinen 2014 Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Analysis of the subjective meaning: word 'health' Analysis of the State of the Union Adresses Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar: Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity (IJCNN 2012) Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Socio-Historical Text Mining of Newspaper Collections A call for interdisciplinary international collaboration Libraries, researchers within journalism, corpus linguistics, history, sociology, political science, psychology, computer science, machine learning, etc. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Merci! Danke schön! Grazie! Multumesc! ¡Gracias! Thank you! Kiitos! Tack! 謝謝！ Σας ευχαριστούμε! Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014

Analyzing and Improving the Quality of a Historical News Collection - PowerPoint PPT Presentation

Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods Kimmo Kettunen 1 , Timo Honkela 1,2 , Krister Lindn 2 , Pekka Kauppinen 2 , Tuula Pkknen 1 & Jukka

Colleges: The Good News, The Bad News, and Improving ESL Services NICK DAVID & KUANG LI

Improving the Spa0al Scale of Impact Assessment: Analyzing and

Analyzing the Click Dynamic of Jesper Holmstrom Daniel Jonsson News Articles Shared on Twitter

IMPROVING DEINKED PULP IMPROVING DEINKED PULP QUALITY BY OXIDATION WITH QUALITY BY OXIDATION

Improving Reliability Through Analyzing and Debugging Floating-Point Software Ignacio Laguna

Tell it like it is: improving access to better quality cancer care and better quality of life.

NEWS NEWS UPDATE UPDATE .All the latest news from MGI brou HEADLINES Record membership

Lumos: Improving Smart Home IoT Visibility and Interoperability Through Analyzing Mobile Apps

Challenges in Improving Information Quality NISS Data Quality Conference November 30

Academic Quality and Social Responsibility Historical background HISTORICAL BACKGROUND 1623

G-DUR A Middleware for Assembling, Analyzing, and Improving Transactional Protocols Masoud Saeida

Improving quality within higher education institutions: the roles of external quality agencies in

Quality of Service Quality of Service Principles, IntServ, RSVP, DiffServ Improving QOS in IP

Analyzing and Improving Search 1/27/17 From Wednesday: Measuring Performance Completeness :

Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Breathing Room: Improving Air Quality Beyond the Ozone Health Standard Megan Green - Air Quality

Improving GP Coding&Safety Netting The key to quality data with quality outcomes in cancer

WRITING QUALITY CODE IDEAS, TECHNIQUES AND TOOLS FOR IMPROVING THE QUALITY OF WRITTEN CODE

QI TALK TIME Building an Irish Network of Quality Improvers Revisiting the Framework for Improving

Promoting Access and Quality in Education Simultaneously 1.0 The Historical Context of Access The

Something in the Air: Improving Air Quality through Community Partnerships @EnvisionCLT |

Pennine Acute Hospitals NHS Trust: Improvement Journey 1 Pennine Improvement Plan Improving

Quality Account 2017/18 2017/18 Quality Priorities Clinical Effectiveness Awareness and

Improving data quality at Europeana New requirements and methods for better measuring metadata

Analyzing and Improving the Quality of a Historical News Collection - PowerPoint PPT Presentation

Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods Kimmo Kettunen 1 , Timo Honkela 1,2 , Krister Lindn 2 , Pekka Kauppinen 2 , Tuula Pkknen 1 & Jukka

Colleges: The Good News, The Bad News, and Improving ESL Services NICK DAVID &amp; KUANG LI

Improving the Spa0al Scale of Impact Assessment: Analyzing and

Analyzing the Click Dynamic of Jesper Holmstrom Daniel Jonsson News Articles Shared on Twitter

IMPROVING DEINKED PULP IMPROVING DEINKED PULP QUALITY BY OXIDATION WITH QUALITY BY OXIDATION

Improving Reliability Through Analyzing and Debugging Floating-Point Software Ignacio Laguna

Tell it like it is: improving access to better quality cancer care and better quality of life.

NEWS NEWS UPDATE UPDATE .All the latest news from MGI brou HEADLINES Record membership

Lumos: Improving Smart Home IoT Visibility and Interoperability Through Analyzing Mobile Apps

Challenges in Improving Information Quality NISS Data Quality Conference November 30

Academic Quality and Social Responsibility Historical background HISTORICAL BACKGROUND 1623

G-DUR A Middleware for Assembling, Analyzing, and Improving Transactional Protocols Masoud Saeida

Improving quality within higher education institutions: the roles of external quality agencies in

Quality of Service Quality of Service Principles, IntServ, RSVP, DiffServ Improving QOS in IP

Analyzing and Improving Search 1/27/17 From Wednesday: Measuring Performance Completeness :

Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Breathing Room: Improving Air Quality Beyond the Ozone Health Standard Megan Green - Air Quality

Improving GP Coding&amp;Safety Netting The key to quality data with quality outcomes in cancer

WRITING QUALITY CODE IDEAS, TECHNIQUES AND TOOLS FOR IMPROVING THE QUALITY OF WRITTEN CODE

QI TALK TIME Building an Irish Network of Quality Improvers Revisiting the Framework for Improving

Promoting Access and Quality in Education Simultaneously 1.0 The Historical Context of Access The

Something in the Air: Improving Air Quality through Community Partnerships @EnvisionCLT |

Pennine Acute Hospitals NHS Trust: Improvement Journey 1 Pennine Improvement Plan Improving

Quality Account 2017/18 2017/18 Quality Priorities Clinical Effectiveness Awareness and

Improving data quality at Europeana New requirements and methods for better measuring metadata

Colleges: The Good News, The Bad News, and Improving ESL Services NICK DAVID & KUANG LI

Improving GP Coding&Safety Netting The key to quality data with quality outcomes in cancer