data mining in the humanities
play

Data Mining in the Humanities Text Data Beatrice Alex - PowerPoint PPT Presentation

Data Mining in the Humanities Text Data Beatrice Alex balex@inf.ed.ac.uk DigitalHSS Seminar, University of Edinburgh November 19th 2013 Monday, 25 November 2013 MINING WHAT DATA? Electronic text or things that can be turned into it. Born


  1. Data Mining in the Humanities Text ⊂ Data Beatrice Alex balex@inf.ed.ac.uk DigitalHSS Seminar, University of Edinburgh November 19th 2013 Monday, 25 November 2013

  2. MINING WHAT DATA? Electronic text or things that can be turned into it. Born electronic text (research papers, literature, tweets, blogs, comments on blogs etc.). Meta data (collection and document level). Image subtitles (Flickr image titles and subtitles). Video/audio transcripts (YouTube transcripts, TED talks, MOOC transcripts etc.) DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  3. TEXT MINING D escribes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data) . Is very useful for analysing large text collections automatically (overcoming data paralysis). Goal in DHSS research: By analysing large amounts of digitised data, help HSS scholars to discover novel patterns and explore hypotheses. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  4. TYPES OF ANALYSES Word or n-gram frequencies, concordances or collocations analysis. Named entity recognition. Grounding, e.g. geo-referencing. Relation extraction. Clustering, e.g. topic modelling. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  5. NAMED ENTITY RECOGNITION Identification and classification of entity mentions in the text where entity refers to things like: Persons Locations Dates ... Often used for improving access to collections. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  6. CONNECTED HISTORIES http://www.connectedhistories.org DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  7. OLD BAILEY ONLINE http://www.oldbaileyonline.org DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  8. GROUNDING Linking entity mentions in text to a unique identifier, e.g.: Wikipedia pages Lat/longs or Geonames IDs Gene ontologies Goal is to disambiguate between mentions with the same surface form. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  9. GEO-REFERENCING http://placenames.org.uk DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  10. GEO-REFERENCING http://placenames.org.uk DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  11. GEO-REFERENCING Ian Gregory: Mapping the Lakes followed by Spacial Humanities Recognises the importance of geo- referencing in DH research. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  12. RELATION EXTRACTION Identifying relations between entities in text or in meta data Triples: person - author_of - book title, commodity - traded_at - location, person - born_in location DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  13. RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  14. RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  15. RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  16. CLUSTERING Documents or words are clustered into groups based on word probabilities and other features. Single membership clustering versus multi- membership clustering. Hierarchical clustering. Topic modelling. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  17. HIERARCHICAL CLUSTERING Shakespeare Project and Visualising English Print Allison et al., 2011, Literary Lab Pamphlet 1. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  18. TOPIC MODELLING Analysis of Martha Ballard’s diary (27 years of daily entries) by Cameron Blevins DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  19. TOPIC MODELLING Cameron Blevins’ blog Analysis of Martha Ballard’s diary (27 years of daily entries) by Cameron Blevins DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  20. Ted Underwood’s blog DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  21. SUMMARY SO FAR Different types of text analyses applied to historical and literary research but new opportunities are endless. Text mining can assist scholars in their research but it is not replacing them! Traditional scholarship is well suited to close reading. DHSS scholars can focus on questions which can be answered by computational methods. Human interpretation is vital. Visualisation of TM output is important. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  22. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  23. PROJECT OVERVIEW JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  24. PROJECT TEAM Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Jim Clifford: historical analysis James Reid, Nicola Osborne : data management, social media Aaron Quigley, Uta Hinrichs: information visualisation DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  25. TRADITIONAL HISTORICAL RESEARCH Global Fats Supply 1894-98 Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  26. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Directors’ Letters of 14,340 n/a Correspondence (Kew) Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  27. SYSTEM Lexicons & Gazetteers Annotated Documents Text Mining Documents XML 2 RDB Query Interface Commodities Commodities Ontology RDB S O K S Visualisation DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  28. MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  29. MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  30. EDINBURGH GEOPARSER DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  31. COMMODITY LEXICON Seed set from customs import records. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  32. COMMODITY LEXICON Seed set from customs import records. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  33. COMMODITY LEXICON Seed set from customs import records. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  34. SIBLING ACQUISITION ? DBpedia ? DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  35. LEXICON BOOTSTRAPPING Seed lexicon ~600 Extended lexicon ~17,000 With pluralisation of ~20,500 single word entries DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  36. EVALUATION DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

  37. INTERMEDIATE RESULTS Lexicon with 20,476 entries and 16,928 concepts. The prototype detected 31,169,104 commodities in 7 billion words. They correspond to 5,841 different commodities (4,466 concepts) and cover 28.5% of commodities in the lexicon. DigitalHSS Seminar, University of Edinburgh, November 19th 2013 Monday, 25 November 2013

Recommend


More recommend