beatrice alex balex inf ed ac uk
play

Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian - PowerPoint PPT Presentation

Finding Commodities in the Nineteenth Century British World: A collaboration between text miners and historians Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian Studies, York University, Toronto October 11th 2013 OVERVIEW Trading


  1. Finding Commodities in the Nineteenth Century British World: A collaboration between text miners and historians Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian Studies, York University, Toronto October 11th 2013

  2. OVERVIEW Trading Consequences Text mining Lexicon/thesaurus creation Evaluation and fine-tuning Robarts Centre for Canadian Studies, York University, 11/10/2013

  3. TRADING CONSEQUENCES Robarts Centre for Canadian Studies, York University, 11/10/2013

  4. PROJECT OVERVIEW JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. Robarts Centre for Canadian Studies, York University, 11/10/2013

  5. PROJECT TEAM Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Jim Cli ff ord: historical analysis James Reid: data management & integration Aaron Quigley, Uta Hinrichs: information visualisation Robarts Centre for Canadian Studies, York University, 11/10/2013

  6. Robarts Centre for Canadian Studies, York University, 11/10/2013

  7. TRADITIONAL HISTORICAL RESEARCH Global Fats Supply 1894-98 Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Robarts Centre for Canadian Studies, York University, 11/10/2013

  8. 0 0 0 0 0 Robarts Centre for Canadian Studies, York University, 11/09/2013

  9. 0 0 0 0 0 Robarts Centre for Canadian Studies, York University, 11/09/2013

  10. DOCUMENT COLLECTIONS Collection # of Documents # of Images HCPP 118,526 6,448,739 ECO 83,016 3,938,758 Kew Directors’ Letters 14,340 n/a Confidential Prints 1,315 140,010 FCOC (partial) 1,000 41,611 NEW: NCCO AATW 4,725 948,773 (ocred: 450,841) Robarts Centre for Canadian Studies, York University, 11/10/2013

  11. SYSTEM Lexicons & Gazetteers Annotated Documents Text Mining Documents XML 2 RDB Query Interface Commodities Commodities Ontology RDB S O K S Visualisation Robarts Centre for Canadian Studies, York University, 11/10/2013

  12. USER INTERFACE Robarts Centre for Canadian Studies, York University, 11/10/2013

  13. COMMODITY RELATIONS Robarts Centre for Canadian Studies, York University, 11/10/2013

  14. TEXT MINING Robarts Centre for Canadian Studies, York University, 11/10/2013

  15. TEXT MINING D escribes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data) . Is very useful for analysing large text collections automatically. (data paralysis) Robarts Centre for Canadian Studies, York University, 11/10/2013

  16. TM IN DIGITAL HISTORY Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypotheses. Change to traditional history. Robarts Centre for Canadian Studies, York University, 11/10/2013

  17. TEXT MINING TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Our focus is on named entity recognition , entity grounding and relation extraction . Robarts Centre for Canadian Studies, York University, 11/10/2013

  18. MINED INFORMATION Example sentence: Robarts Centre for Canadian Studies, York University, 11/10/2013

  19. MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Robarts Centre for Canadian Studies, York University, 11/10/2013

  20. MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America Robarts Centre for Canadian Studies, York University, 11/10/2013

  21. NOISY DATA Optical character recognition contains many errors and often the structure of the page layout is lost. Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text. Robarts Centre for Canadian Studies, York University, 11/10/2013

  22. FIXING NOISY DATA Text normalisation and correction: End-of-line soft hyphen removal Dehyphen all token-splitting hyphens using a dictionary- based approach. “False f”-to-s conversion Convert all false f characters to s using a corpus. Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012). Robarts Centre for Canadian Studies, York University, 11/10/2013

  23. FIXING NOISY DATA Robarts Centre for Canadian Studies, York University, 11/10/2013

  24. FIXING NOISY DATA Robarts Centre for Canadian Studies, York University, 11/10/2013

  25. HOW NOISY IS TOO NOISY? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Extract from document 10.2307/60238580 in FCOC. Robarts Centre for Canadian Studies, York University, 11/10/2013

  26. HOW NOISY IS TOO NOISY? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Extract from document 10.2307/60238580 in FCOC. Robarts Centre for Canadian Studies, York University, 11/10/2013

  27. COMMODITY LEXICON CREATION Robarts Centre for Canadian Studies, York University, 11/10/2013

  28. EXTRACTED INFO Example sentence: Normalised and grounded entities: commodity: cassia bark [ concept: Cinnamomum cassia ] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Robarts Centre for Canadian Studies, York University, 11/10/2013

  29. SEED SET Customs import records. Robarts Centre for Canadian Studies, York University, 11/10/2013

  30. SEED SET Robarts Centre for Canadian Studies, York University, 11/10/2013

  31. SEED SET Robarts Centre for Canadian Studies, York University, 11/10/2013

  32. STRUCTURE How should synonyms be represented? How should commodity mentions be grounded? How do we group commodities together by type? Robarts Centre for Canadian Studies, York University, 11/10/2013

  33. SKOS Simple Knowledge Organization System Designed to bridge between Thesauri, classifications, and legacy KOS OWL-based formal ontologies Looser semantics than strict hierarchies Robarts Centre for Canadian Studies, York University, 11/10/2013

  34. EXAMPLE skos:Concept : rdf:type ex:Cassia_Bar k skos:prefLabel “cassia bark”@en skos:altLabel “cinnamonum cassia”@en Robarts Centre for Canadian Studies, York University, 11/10/2013

  35. EXAMPLE skos:Concept: rdf:type ex:Cassia_B ex:Cassia_bark: skos:prefLabel “cassia bark”@en skos:altLabel “cinnamonum cassia”@en skos:broader ex:Commodity skos:broader skos:Concept: rdf:type ex:Cassia_B ex:Mahogany skos:prefLabel “mahogany”@en Robarts Centre for Canadian Studies, York University, 11/10/2013

  36. LEXICON DEVELOPMENT Concepts labeled by URIs (global IDs) reuse rather than coin V1: Umbel (derived from OpenCyc) V2: DBpedia (ontology based on Wikipedia) Robarts Centre for Canadian Studies, York University, 11/10/2013

  37. LEXICON DEVELOPMENT Concepts labeled by URIs (global IDs) reuse rather than coin V1: Umbel (derived from OpenCyc) V2: DBpedia (ontology based on Wikipedia) Robarts Centre for Canadian Studies, York University, 11/10/2013

  38. EXAMPLE Robarts Centre for Canadian Studies, York University, 11/10/2013

  39. EXAMPLE Robarts Centre for Canadian Studies, York University, 11/10/2013

  40. HIERARCHY root concept wikimedia categories leaf concepts Robarts Centre for Canadian Studies, York University, 11/10/2013

  41. SIBLING ACQUISITION ? DBpedia ? Robarts Centre for Canadian Studies, York University, 11/10/2013

  42. LEXICON BOOTSTRAPPING Seed lexicon ~600 DBpedia extended ~17,000 lexicon With pluralisation of ~20,500 single word entries Robarts Centre for Canadian Studies, York University, 11/10/2013

  43. EVALUATION Robarts Centre for Canadian Studies, York University, 11/10/2013

  44. INTERMEDIATE RESULTS Lexicon with 20,476 entries and 16,928 concepts. Need to evaluate lexicon precision and recall. Frequency distribution of all commodities detected in our data (31,169,104 in 7 billion words). Found 5,841 different commodities (belonging to 4,466 concepts) in the data: 28.5% of commodities in the lexicon. Robarts Centre for Canadian Studies, York University, 11/10/2013

Recommend


More recommend