bootstrapping a historical commodities lexicon with skos
play

Bootstrapping a historical commodities lexicon with SKOS and - PowerPoint PPT Presentation

Bootstrapping a historical commodities lexicon with SKOS and DBpedia. Ewan Klein, Beatrice Alex, Jim Clifford @digtrade LaTeCH 2014, Gothenburg, April 26th 2014 PROJECT OVERVIEW JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013


  1. Bootstrapping a historical commodities lexicon with SKOS and DBpedia. Ewan Klein, Beatrice Alex, Jim Clifford @digtrade LaTeCH 2014, Gothenburg, April 26th 2014

  2. PROJECT OVERVIEW JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. LaTeCH 2014, Gothenburg, April 26th 2014

  3. PROJECT TEAM Ewan Klein, Beatrice Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Andrew Watson: historical analysis Jim Clifford: historical analysis James Reid, Nicola Osborne: data management, social media Aaron Quigley, Uta Hinrichs: information visualisation LaTeCH 2014, Gothenburg, April 26th 2014

  4. COMMODITY LEXICON CREATION LaTeCH 2014, Gothenburg, April 26th 2014

  5. SEED SET Seed set from customs import records. LaTeCH 2014, Gothenburg, April 26th 2014

  6. SEED SET Seed set from customs import records. LaTeCH 2014, Gothenburg, April 26th 2014

  7. SEED SET Seed set from customs import records. LaTeCH 2014, Gothenburg, April 26th 2014

  8. STRUCTURE How should synonyms be represented? donkey ~ ass How should commodity mentions be grounded? cinnamon -> cinnamonum verum -> cinnamonum cassia How do we group commodities together by type? lemons, limes, oranges -> citrus fruit LaTeCH 2014, Gothenburg, April 26th 2014

  9. SKOS Simple Knowledge Organisation System A W3C initiative for the representation of thesauri, classification schemes, taxonomies etc. A standard way to represent knowledge organisation systems using the Resource Description Framework. Looser semantics than strict hierarchies. LaTeCH 2014, Gothenburg, April 26th 2014

  10. EXAMPLE skos:Concept: rdf:type dbp:Cassia_bark: skos:prefLabel “cassia bark”@en skos:altLabel “cinnamonum cassia”@en skos:broader dbp:Commodity skos:broader skos:Concept: rdf:type dbp:Mahogany skos:prefLabel “mahogany”@en LaTeCH 2014, Gothenburg, April 26th 2014

  11. SEED SET IN SKOS !"#$%&' &(%)*+,%- +-'*+,%- !"#$%&'()*+,-.'/,01 2&'( !"#$%&'3+.,0 2&'3+.,0 /3!/,3442&'34+.,0542&'34+.,0 !"#$%&--&3 2&--&3 2&--&346/".' !"#$%&--&3)7..! 2&--&347..! !"#$%&8'9 2&8'9 2&8'/. !"#$%&9#: 2&9#: 3:-'/,54'/;.'4',- !"#$%',3".''9 2',3".''9 !"#$%'&-&3)2,72,'/00, 2'&-&342,72,'/00, 2,72,'/00, !"#$%'&-&3)&/0 2'&-&34&/0 !"#$%:"." 2:"." 2:"/"54<,;,4#.##.' !"#$%:0+ 2:0+ !"#$=,++,')>:+ !,++,'4>:+ >:+4!,++,' !"#$=..' !..' !"#$=/#7,2:7 !/#7,2:7 4-.,7.0 !"#$=&+.7-/2)7?..# !&+.7-/247?..# !"#$=&3(.9 !&3(.9 ,77 !"#$=',2,.3,)2/33,",'/ !',2,.3,42/33,",'/ 7,3>:/74!',2&3/754>:+4!',>&3@74"0&&! LaTeCH 2014, Gothenburg, April 26th 2014

  12. EXAMPLE LaTeCH 2014, Gothenburg, April 26th 2014

  13. SIBLING ACQUISITION base thesaurus category acquisition sibling acquisition LaTeCH 2014, Gothenburg, April 26th 2014

  14. HIERARCHY root concept wikimedia categories leaf concepts LaTeCH 2014, Gothenburg, April 26th 2014

  15. LEXICON IN XML LaTeCH 2014, Gothenburg, April 26th 2014

  16. LEXICON BOOTSTRAPPING Seed lexicon 319 concepts Extended lexicon 16,928 concepts With pluralisation of 20,476 entries single word entries LaTeCH 2014, Gothenburg, April 26th 2014

  17. EVALUATION LaTeCH 2014, Gothenburg, April 26th 2014

  18. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Directors’ Letters of 14,340 n/a Correspondence (Kew) Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection LaTeCH 2014, Gothenburg, April 26th 2014

  19. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Over 10 million document pages, Directors’ Letters of 14,340 n/a Over 7 billion word tokens. Correspondence (Kew) Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection LaTeCH 2014, Gothenburg, April 26th 2014

  20. INTERMEDIATE RESULTS Lexicon with 20,476 entries and 16,928 concepts. Need to evaluate lexicon precision and recall. Commodity recognition using rule-based (context and linguistically sensitive) matching. Frequency distribution of all commodities detected in our data (31,169,104 in 7 billion words). Found 5,841 different commodities (belonging to 4,466 concepts) in the data: 28.5% (26.4%) of commodities in the lexicon. LaTeCH 2014, Gothenburg, April 26th 2014

  21. EVALUATION How well does our commodity recognition perform on a random test set? Indirect evaluation using annotated gold standard: Let human annotator mark up commodities in 120 documents manually. Compared that against the text mining output. LaTeCH 2014, Gothenburg, April 26th 2014

  22. PROTOTYPE EVALUATION Error analysis showed errors in the lexicon and boundary errors affect precision. Boundary errors, OCR errors and spelling variations affect recall. LaTeCH 2014, Gothenburg, April 26th 2014

  23. LaTeCH 2014, Gothenburg, April 26th 2014

  24. LEXICON PRECISION ... From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. Error types: wrong (village account), too general (crop), ambiguous due to OCR error (lime), not in definition (paper) LaTeCH 2014, Gothenburg, April 26th 2014

  25. LEXICON PRECISION ... From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. Error types: wrong (village account), too general (crop), ambiguous due to OCR error (lime), not in definition (paper) LaTeCH 2014, Gothenburg, April 26th 2014

  26. FALSE NEGATIVES Hand annotated texts contain 1,107 commodity mentions (506 different entities). 178 entities (683 mentions) are in the first version of the expanded lexicon. 329 terms (424 mentions) are not in the lexicon: 110 (115 mentions) contain OCR errors, approx. 10% of all commodity mentions. 160 commodities are missing, 59 should not be added. LaTeCH 2014, Gothenburg, April 26th 2014

  27. IMPROVING RECALL Bigram analysis to bootstrap further commodities semi-automatically. LaTeCH 2014, Gothenburg, April 26th 2014

  28. IMPROVEMENTS i. Removing terms based on frequency analysis ii. Boundary extension rules iii.Adding terms based on bigram analysis iv.Combination of i-v (with new lexicon: 17,247 concepts and 22,723 entries) LaTeCH 2014, Gothenburg, April 26th 2014

  29. SYSTEM PERFORMANCE LaTeCH 2014, Gothenburg, April 26th 2014

  30. LESSONS LEARNED SKOS is useful for organising a lexicon. We developed a method for bootstrapping from a seed set using categorial similarity of other entities. Expert knowledge and historians’ input was important for optimisation. Bootstrapping a lexicon and text mining are not error free (but even human experts can disagree). LaTeCH 2014, Gothenburg, April 26th 2014

  31. USER INTERFACE LaTeCH 2014, Gothenburg, April 26th 2014

  32. THANK YOU Website: http://tradingconsequences.blogs.edina.ac.uk/ Demo: http://tcqdev.edina.ac.uk/search/commodity/ , http://tcqdev.edina.ac.uk/vis/tradConVis Contact: balex@inf.ed.ac.uk LaTeCH 2014, Gothenburg, April 26th 2014

  33. TRADITIONAL HISTORICAL RESEARCH Global Fats Supply 1894-98 Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. LaTeCH 2014, Gothenburg, April 26th 2014

  34. SYSTEM Lexicons & Gazetteers Annotated Documents Text Mining Documents XML 2 RDB Query Interface Commodities Commodities Ontology RDB S O K S Visualisation LaTeCH 2014, Gothenburg, April 26th 2014

  35. MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs LaTeCH 2014, Gothenburg, April 26th 2014

  36. MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America LaTeCH 2014, Gothenburg, April 26th 2014

  37. SIBLING ACQUISITION LaTeCH 2014, Gothenburg, April 26th 2014

  38. EXAMPLE LaTeCH 2014, Gothenburg, April 26th 2014

  39. CATEGORY ACQUISITION base thesaurus category acquisition sibling acquisition LaTeCH 2014, Gothenburg, April 26th 2014

Recommend


More recommend