Finding Commodities in the Nineteenth Century British World: A collaboration between text miners and historians Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian Studies, York University, Toronto October 11th 2013
OVERVIEW Trading Consequences Text mining Lexicon/thesaurus creation Evaluation and fine-tuning Robarts Centre for Canadian Studies, York University, 11/10/2013
TRADING CONSEQUENCES Robarts Centre for Canadian Studies, York University, 11/10/2013
PROJECT OVERVIEW JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. Robarts Centre for Canadian Studies, York University, 11/10/2013
PROJECT TEAM Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Jim Cli ff ord: historical analysis James Reid: data management & integration Aaron Quigley, Uta Hinrichs: information visualisation Robarts Centre for Canadian Studies, York University, 11/10/2013
Robarts Centre for Canadian Studies, York University, 11/10/2013
TRADITIONAL HISTORICAL RESEARCH Global Fats Supply 1894-98 Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Robarts Centre for Canadian Studies, York University, 11/10/2013
0 0 0 0 0 Robarts Centre for Canadian Studies, York University, 11/09/2013
0 0 0 0 0 Robarts Centre for Canadian Studies, York University, 11/09/2013
DOCUMENT COLLECTIONS Collection # of Documents # of Images HCPP 118,526 6,448,739 ECO 83,016 3,938,758 Kew Directors’ Letters 14,340 n/a Confidential Prints 1,315 140,010 FCOC (partial) 1,000 41,611 NEW: NCCO AATW 4,725 948,773 (ocred: 450,841) Robarts Centre for Canadian Studies, York University, 11/10/2013
SYSTEM Lexicons & Gazetteers Annotated Documents Text Mining Documents XML 2 RDB Query Interface Commodities Commodities Ontology RDB S O K S Visualisation Robarts Centre for Canadian Studies, York University, 11/10/2013
USER INTERFACE Robarts Centre for Canadian Studies, York University, 11/10/2013
COMMODITY RELATIONS Robarts Centre for Canadian Studies, York University, 11/10/2013
TEXT MINING Robarts Centre for Canadian Studies, York University, 11/10/2013
TEXT MINING D escribes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data) . Is very useful for analysing large text collections automatically. (data paralysis) Robarts Centre for Canadian Studies, York University, 11/10/2013
TM IN DIGITAL HISTORY Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypotheses. Change to traditional history. Robarts Centre for Canadian Studies, York University, 11/10/2013
TEXT MINING TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Our focus is on named entity recognition , entity grounding and relation extraction . Robarts Centre for Canadian Studies, York University, 11/10/2013
MINED INFORMATION Example sentence: Robarts Centre for Canadian Studies, York University, 11/10/2013
MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Robarts Centre for Canadian Studies, York University, 11/10/2013
MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America Robarts Centre for Canadian Studies, York University, 11/10/2013
NOISY DATA Optical character recognition contains many errors and often the structure of the page layout is lost. Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text. Robarts Centre for Canadian Studies, York University, 11/10/2013
FIXING NOISY DATA Text normalisation and correction: End-of-line soft hyphen removal Dehyphen all token-splitting hyphens using a dictionary- based approach. “False f”-to-s conversion Convert all false f characters to s using a corpus. Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012). Robarts Centre for Canadian Studies, York University, 11/10/2013
FIXING NOISY DATA Robarts Centre for Canadian Studies, York University, 11/10/2013
FIXING NOISY DATA Robarts Centre for Canadian Studies, York University, 11/10/2013
HOW NOISY IS TOO NOISY? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Extract from document 10.2307/60238580 in FCOC. Robarts Centre for Canadian Studies, York University, 11/10/2013
HOW NOISY IS TOO NOISY? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Extract from document 10.2307/60238580 in FCOC. Robarts Centre for Canadian Studies, York University, 11/10/2013
COMMODITY LEXICON CREATION Robarts Centre for Canadian Studies, York University, 11/10/2013
EXTRACTED INFO Example sentence: Normalised and grounded entities: commodity: cassia bark [ concept: Cinnamomum cassia ] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Robarts Centre for Canadian Studies, York University, 11/10/2013
SEED SET Customs import records. Robarts Centre for Canadian Studies, York University, 11/10/2013
SEED SET Robarts Centre for Canadian Studies, York University, 11/10/2013
SEED SET Robarts Centre for Canadian Studies, York University, 11/10/2013
STRUCTURE How should synonyms be represented? How should commodity mentions be grounded? How do we group commodities together by type? Robarts Centre for Canadian Studies, York University, 11/10/2013
SKOS Simple Knowledge Organization System Designed to bridge between Thesauri, classifications, and legacy KOS OWL-based formal ontologies Looser semantics than strict hierarchies Robarts Centre for Canadian Studies, York University, 11/10/2013
EXAMPLE skos:Concept : rdf:type ex:Cassia_Bar k skos:prefLabel “cassia bark”@en skos:altLabel “cinnamonum cassia”@en Robarts Centre for Canadian Studies, York University, 11/10/2013
EXAMPLE skos:Concept: rdf:type ex:Cassia_B ex:Cassia_bark: skos:prefLabel “cassia bark”@en skos:altLabel “cinnamonum cassia”@en skos:broader ex:Commodity skos:broader skos:Concept: rdf:type ex:Cassia_B ex:Mahogany skos:prefLabel “mahogany”@en Robarts Centre for Canadian Studies, York University, 11/10/2013
LEXICON DEVELOPMENT Concepts labeled by URIs (global IDs) reuse rather than coin V1: Umbel (derived from OpenCyc) V2: DBpedia (ontology based on Wikipedia) Robarts Centre for Canadian Studies, York University, 11/10/2013
LEXICON DEVELOPMENT Concepts labeled by URIs (global IDs) reuse rather than coin V1: Umbel (derived from OpenCyc) V2: DBpedia (ontology based on Wikipedia) Robarts Centre for Canadian Studies, York University, 11/10/2013
EXAMPLE Robarts Centre for Canadian Studies, York University, 11/10/2013
EXAMPLE Robarts Centre for Canadian Studies, York University, 11/10/2013
HIERARCHY root concept wikimedia categories leaf concepts Robarts Centre for Canadian Studies, York University, 11/10/2013
SIBLING ACQUISITION ? DBpedia ? Robarts Centre for Canadian Studies, York University, 11/10/2013
LEXICON BOOTSTRAPPING Seed lexicon ~600 DBpedia extended ~17,000 lexicon With pluralisation of ~20,500 single word entries Robarts Centre for Canadian Studies, York University, 11/10/2013
EVALUATION Robarts Centre for Canadian Studies, York University, 11/10/2013
INTERMEDIATE RESULTS Lexicon with 20,476 entries and 16,928 concepts. Need to evaluate lexicon precision and recall. Frequency distribution of all commodities detected in our data (31,169,104 in 7 billion words). Found 5,841 different commodities (belonging to 4,466 concepts) in the data: 28.5% of commodities in the lexicon. Robarts Centre for Canadian Studies, York University, 11/10/2013
Recommend
More recommend