  2. OVERVIEW What is text mining? Text Mining in digital history Trading Consequences “Big data” Visualisation Challenge of noisy data Collaborating with historians Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  3. TEXT MINING D escribes a set of linguistic, statistical and machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data) . Is very useful for analysing large text collections automatically. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  4. TEXT MINING TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Currently our focus is on named entity recognition , entity grounding and relation extraction . Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  5. TM IN DIGITAL HISTORY Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypothesis. Methods: linguistic text analysis, named entity recognition, geo-grounding and relation extraction to transform the text into structured data. Sea-change to methods used in ‘traditional’ history. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  "TRADITIONAL" HISTORICAL RESEARCH Global Fats Supply 1894-98 Cinchona plantations in George King's A Manual of Cinchona Cultivation in India (1880).

  TRADING CONSEQUENCES Digging into Data II project (till Dec. 2013) Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex, Dr. Claire Grover, Clare Llewellyn, Richard Tobin, James Reid, Nicola Osborne, Ian Fieldhouse

  8. Trading Consequences TRADING CONSEQUENCES Bea Alex, Timothy Bristow, Jim Clifford, Colin Coates, Ian Fieldhouse, Claire Grover, Uta Hinrichs, Ewan Klein, Clare Llewellyn, Nicola Osborne, Aaron Quigley, James Reid and Richard Tobin Contact:, Twitter: @digtrade Blog: Text)mining)and)ontology)management Data)integra(on)&)dissemina(on Location Lo id geonames: ProQuest ʼ s House of Commons Parliamentary Papers :Spice 1633419 text Padang latitude -0.94924 Locatio ocationMention Kew Garden ʼ s Director ʼ s JSTOR ʼ s Foreign and Commonwealth longitude 100.35427 skos:narrowerThan id rb5370 geom 0101000020E610 skos:narrowerThan text Padang 00000DFD135CA Correspondence Archive Office collection (sample) C165940E3C2819 start_word w446944 02C60EEBF end_word w446944 :Cinnamon_Spice tc:Cassia_Bark in_country Indonesia gazref geonames: DateMention Date AMD Confidential Prints Early Canadiana Online 1633419 id rb5371 feature_ty populated place pe text 1871 skos:prefLabel skos:prefLabel direction origin year 1871 month skos:altLabel skos:altLabel day Document Commod mmodityMention cinnamon cassia bark !! From!Padang!was!exported,!in!1871,!6,127!piculs!of Type to enter text id spices1912ridley id rb5373 docid spices1912ridley Quanti uantityMention preLabel cinamon ! cassia!bark,!of!which!a!large!portion!was!shipped!to title Spices id rb5372 text cassia bark cinnamomum cassia cinnamomum vera url start_word w446990 text 6127 piculs ! America!(Fliickiger!and!Hanbury).!...!! pubdate 1912 end_word w446997 quantity 6127 type text date 1871 unit piculs ! (excerpt!from!Spices,!Ridley,!1912) author Ridley, Henry N. lang eng Comm ommodityMention id rb5373 Collection text cassia bark id books prefLabel cassia bark text books altLabel cinnamonum cassia Historical)analysis)&)) ontology)development "Captive Tomes" by traceyp3031 on Flickr Informa(on)visualisa(on "Library Archives 05” by peteashton on Flickr Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  9. TRADING CONSEQUENCES What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century? Scope: global, but with focus on Canadian natural resources. Example questions: ‣ What were the routes and volumes of international trade in resource commodities in the nineteenth century? Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  10. DOCUMENT COLLECTIONS Big data for historians: Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  MINED INFORMATION Example sentence:

  12. MINED INFORMATION Example sentence: Extracted entities: commodity: cassia bark date: 1871 location: Padang location: America quantity + unit: 6,127 piculs Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  13. MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  14. MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  15. COMMODITY ONTOLOGY :Spice skos:narrowerThan skos:narrowerThan :Cinnamon_Spice tc:Cassia_Bark skos:prefLabel skos:prefLabel skos:altLabel skos:altLabel cinnamon cassia bark cinnamomum cassia cinnamomum vera Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  IMPROVED SEARCH & VISUALISATIONS

  IMPROVED SEARCH & VISUALISATIONS

  IMPROVED SEARCH & VISUALISATIONS

  19. NOISY DATA Optical character recognition contains many errors and often the structure of the page layout is lost. Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  20. FIXING NOISY DATA Text normalisation and correction: End-of-line soft hyphen removal Dehyphen all token-splitting hyphens using a dictionary- based approach. “False f”-to-s conversion Convert all false f characters to s using a corpus. Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012). Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  FIXING NOISY DATA

  FIXING NOISY DATA

  23. HOW NOISY IS TOO NOISY? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Extract from document 10.2307/60238580 in FCOC. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  25. THE USERS (HISTORIANS) Involvement of historians: Everything is based on the use cases and build on users’ hypotheses/research questions. They are responsible for identification of relevant collections and are involved in the ontology development. They provide feedback for us to improve technology iteratively: Partners at York use of the prototype for their research and track errors; Workshop at CHESS 2013 with a group of independent historians Clarity of the text mining accuracy is Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  26. SUMMARY Text mining historic documents in Trading Consequences. Processing “big data”. Power of visualising structured data. Fixing noisy data. Importance of two-way collaboration between technology experts and users in digital history. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

  THANK YOU Questions? Fire away or contact me at:

