Text Mining and Historical Research Beatrice Alex balex@inf.ed.ac.uk MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OVERVIEW What is text mining? Types of text analyses. Trading Consequences: text mining applied. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
TEXT MINING D escribes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data) . Is very useful for analysing large text collections automatically (overcoming data paralysis). Goal in DHSS research: By analysing large amounts of textual data, help HSS scholars to discover novel patterns and explore hypotheses. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
MINING WHAT TEXT? Electronic text or things that can be turned into it. Born electronic text (research papers, literature, tweets, blogs, comments on blogs etc.). Digitised text documents. Meta data (collection and document level). Image subtitles (Flickr image titles and subtitles). Video/audio transcripts (YouTube transcripts, TED talks, MOOC transcripts etc.) MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
TYPES OF ANALYSES Named entity recognition. Grounding, e.g. geo-referencing. Relation extraction. Clustering, e.g. topic modelling. Sentiment analysis. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
NAMED ENTITY RECOGNITION Identification and classification of entity mentions in text, things like: Names of persons, locations, organisations,... Dates, amounts ... Often used for improving access to collections. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
CONNECTED HISTORIES http://www.connectedhistories.org MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OLD BAILEY ONLINE http://www.oldbaileyonline.org MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
GROUNDING Linking entity mentions in text to a unique identifier, e.g.: Person names to their Wikipedia pages Location names to lat/longs or Geonames IDs Gene names to gene ontologies Goal is to disambiguate between mentions with the same surface form (e.g. “Paris”, “Victoria”). MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
EDINBURGH GEOPARSER MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
RELATION EXTRACTION Identifying relations between entities in text or in meta data in order to Triples: person - author_of -> book title, commodity - traded_at -> location person - born_in -> location person - born_at -> date MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
SUMMARY SO FAR Different types of text analyses applied to historical and literary research but new opportunities are endless. Text mining can assist scholars in their research but it is not replacing them! Traditional scholarship is well suited to close reading. HSS scholars can focus on questions which can be answered by computational methods. Human interpretation is vital. Visualisation of TM output is important. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
PROJECT TEAM Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Andrew Watson: historical analysis Jim Clifford: historical analysis James Reid, Nicola Osborne : data management, social media Aaron Quigley, Uta Hinrichs: information visualisation MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
TRADITIONAL HISTORICAL RESEARCH Global Fats Supply 1894-98 Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
PROJECT GOALS Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Directors’ Letters of 14,340 n/a Correspondence (Kew) Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Over 10 million document pages, Directors’ Letters of 14,340 n/a Correspondence (Kew) Over 7 billion word tokens. Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
COMMODITY LEXICON Seed set from customs import records. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
LEXICON CREATION Seed lexicon ~600 Extended lexicon ~17,000 With pluralisation of ~20,500 single word entries MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
LEXICON PRECISION ... From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
LEXICON PRECISION ... From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
OCR ERRORS MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
BRINGING ARCHIVES ALIVE MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
SUMMARY You have access to enormous amounts of data. Text mining can be applied to process large text collections, enrich existing text with information or pull out trends which can be visualised. Text mining is a way to enable distant reading, even if such technology is not 100% accurate. OCR errors in digitised collections can skew your results. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014
Recommend
More recommend