What? Investigating what a corpus is about Max Kemman University of - PowerPoint PPT Presentation

What? Investigating what a corpus is about Max Kemman University of Luxembourg October 25, 2015 Doing Digital History: Introduction to Tools and Technology

Recap from last time What is distant reading? What is an n-gram? What do the Y-axis and X-axis show?

Recap - Assignment How did the assignment go? What did you think of the tools used? Could this be useful for your research?

One more thing on HTML: special characters http://www.ascii.cl/htmlcodes.htm Find the symbol and the HTML number é & ü -> � & � é & ü -> é & ü In your HTML, write longue durée to write longue durée

One more thing: what is an algorithm? A set of rules to follow to solve a problem Pretty much like a cooking recipe a = 0 while(a < 10) { a = a + 1 }

Today • The W's of research • What a corpus is about • The entities in a corpus • Another look at our emails • Voyant Tools • Next time • Assignment

The W's of research Thus far: 1. Abundance of sources 2. Writing for the Web 3. Digitisation and Digital Libraries 4. Big Data 5. Distant Reading Now: we have a digital corpus, what to do with it?

Research the corpus Now come the W's of research: 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus 3. When - Investigating the temporal entities in a corpus 4. Who - Investigating the social entities in a corpus

What? The first W of interest, what is this corpus actually about? Different methods are possible • Find a description of the corpus to read • Select a sample of documents to read • Visualize the used words

What a corpus is about

What is this conference about?

Word clouds Advantages of word clouds • Very easy to create • Visually pleasing • Gives a quick overview

What does a word cloud do? Put very simply, a word cloud does the following: 1. Count the number of occurrences per word 2. Size each word by its frequency 3. Layout the words to form a shape 4. Optional: colorize words for distinguishing and better readability

Layout Unlike the Ngram viewer: no X or Y axes The position of each word is meaningless The meaning is in the size of the words

Counting Word clouds visualize the frequency of words But how to count words that vary in spelling? • E.g. "Digital" and "digital" and "digitally", "digitize" and "digitization" Normalization: • Lowercase • Tokenize • Stemming or lemmatizing • Stopwords

Lowercase We were on vacation in France in August 2015 we were on vacation in france in august 2015

Tokenize we were on vacation, in france, in august 2015 we|were|on|vacation|in|france|in|august|2015

Stemming or lemmatizing digitized|digital|digitization|digitizing Stemming: digit Lemmatizing: digitiz|digital Could be very useful especially with Latin texts

What are these grants about? (normalized)

Comparing between different parts of the corpus Sources separated by their citation behaviour

Representing a model of the text What if we do not know how to separate sources? Or if we want to know what other words are related to our keywords?

Topic modelling Documents and words can be directly observed, but topics are latent How to represent the topics in a corpus? • Statistics to find topics represented by groups of words • Document is a mix of topics • Topic is a mix of words (Slides on topic modelling from Pim Huijnen and Marijn Koolen)

Topic modelling Assumption: two documents with the same topics will have overlap in words For a given corpus, modelling process does: 1. Create word probability distribution for topics 2. Create topic probability distribution for documents

Topic modelling In short: a corpus is represented by statistical topics This allows us to: • Separate sources by topics • Find related keywords

Comparing different parts of the corpus Mendeley Research Maps Comparing the topical similarity Assigned documents to disciplines to map disciplines by topics Which form of machine learning would this be?

What is the corpus about? We can now represent the words or the topics of a corpus But, remember: World War I ≠ "World War I"

The entities in a corpus Thus far we know the frequencies of all the words But what are we interested in? What do we need for the other W's?

The entities in a corpus Thus far we know the frequencies of all the words But what are we interested in? What do we need for the other W's? • Where - places • When - dates • Who - people

People in the corpus Ter Braake & Fokkens - Fairly easy to discover famous people (with biographical dictionaries and Ngram viewers) Ngrams help top-down: when you know who to search for But how to discover who did not become famous, while prominent in their own time? Need to find all people bottom-up by identifying all the names

Bottom-up proces Ter Braake & Fokkens 1. Identify all names in the corpus 2. Give all names an identifier 3. Disambiguate names referring to the same person 4. Compare results with a non-digital corpus 5. Visualize the results 6. Interpret!

Identifying names Combinations of words that start with a capital This won't work for German Their algorithm allows for two sequential lower case words: Johan van der Capellen Note: built for recall, not precision

Recall & Precision Recall: retrieve all relevant entities Precision: do no retrieve irrelevant entities For algorithms usually a choice what to optimize • Recall of people referred to with single name (Erasmus, Rembrandt) would lead to too much noise = lower precision

Difficulties Spelling of names (especially before 19th century) People with the same name Nicknames and changing names People with the same title Context matters!

Named Entity Recognition We want to identify the entities We were on vacation in France in August 2015. I went to shop at the Intermarche. The area around Apt is really nice. Max also bought icecream, which cost €2. We were on vacation in France in August 2015. I went to shop at the Intermarche. The area around Apt is really nice. Max also bought icecream, which cost €2.

Named Entities Or we want to see: We were on vacation in France in August 2015. I went to shop at the Intermarche. The area around Apt is really nice. Max also bought icecream, which cost €2. • People: Max • Places: France, Apt • Organizations: Intermarche • Dates: August 2015 • Currencies: €2

Another look at our emails For all 30k emails, we performed text normalisation and named entity recognition Let's take a look at https://www.wikileaks.org/clinton-emails/emailid/8 Exercise 1: try to normalise the text Exercise 2: try to discover the named entities: People, Places, Organisations

Normalised See Email8-normalised.txt in Moodle under "Emails" unclassife, us, department, state, case, f--, doc, date, release, full, hrod, clintonemailcom, sent, friday, july, pm, sullivanjj, stategov, subject, re, pakistan, bomb, ok, go, original, message, sullivan, jacob, sullivanjj, stategov, sent, fri, jul, subject, pakistan, bomb, fyi, put, follow, statement, statement, secretary, clinton, bomb, shrine, sy, ali, hujviri, lahore, shock, sadden, yesterday, attack, one, pakistan, popular, place, worship, shrine, sy, ali, hujviri, data, ganjbakhsh, lahore, claime, live, many, innocent, pakistane, extremist, shown, respect, neither, human, dignity, fundamental, religious, value, pakistani, society, violact, sanctity, rever, shrine, particularly, sinister, attempt, destabilize, pakistan, intimidate, people, attacker, will, succeed, pakistani, public, refuse, cow, violence, condemn, brutal, crime, reaffirm, commitment, support, pakistani, people, effort, defend, democracy, violent,

commitment, support, pakistani, people, effort, defend, democracy, violent, extremist, seek, destroy, thought, prayer, family, victim, people, pakistan Named Entities Try to do it by hand NER tool: http://nlp.stanford.edu:8080/ner/ People Places Organisations Sullivan Pakistan U.S. Department of State Case No Jacob Pakistan Shrine of Syed Ali Hujviri CLINTON Lahore Ali Hujviri Pakistan Lahore Pakistan Pakistan

Visualise the email Go to http://tagcrowd.com/ Compare with and without stopwords Compare normal and normalised text

What? So, what's the email about? Do we get different perspectives?

Voyant Tools Go to www.voyant-tools.org/ Use Mozilla Firefox , it doesn't work in Chrome (that's what went wrong during lecture) From Moodle: download the files for emails 6000-6019 f6-20-raw.txt and f6-20-normalised.txt You can paste in text, or upload the file Continue by hitting reveal

What? Investigating what a corpus is about Max Kemman University of - PowerPoint PPT Presentation

What? Investigating what a corpus is about Max Kemman University of Luxembourg October 25, 2015 Doing Digital History: Introduction to Tools and Technology Recap from last time What is distant reading? What is an n-gram? What do the Y-axis

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Investigating Potential Investigating Potential Biases in Aerosol Light Biases in Aerosol Light

Who? Investigating the social entities in a corpus Max Kemman University of Luxembourg December

Who? Investigating the social entities in a corpus Max Kemman University of Luxembourg December

When? Investigating the temporal entities in a corpus Max Kemman University of Luxembourg

SH 358 IMPROVEMENTS Corpus Christi District Updated October 2018 SH 358 Improvements Corpus

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Smarter and Trustworthy.

FY 2019 FY 2022 RURAL TRANSPORTATION IMPROVEMENT PROGRAM Corpus Christi District April 19,

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

City of Corpus Christi Raw Water Supply Strategies Council Presentation July 24, 2018 1

Getting to know your corpus: applying Topic Modelling to a corpus of research articles Paul

Corpus Analysis from a Mathematical Perspective Corpus Statistics Research Group launch event

From the Curse of Cash to the Burden of Digitization Kenneth Rogoff, Harvard University Workshop

Log-Gaussian Cox Process for London crime data Jan Povala with Louis Ellam Dr Seppo Virtanen

From Isolation to Radicalization: Anti-Muslim Hostility and Support for ISIS in the West Tamar

Path Logics for Q uerying Graphs combining expressiveness and efficiency Diego Figueira CNRS,

Housekeeping Agenda Introduction Amy Bell OPBAS findings Common Issues for Law

New Banks seminar New Bank Start-up Unit 9 June 2017 NBSU Seminar How to become a bank 2 How

12 th March 2019 Mercer Boston, 99 High Street, Financial District, Boston, MA 02110, USA.

Demogr ographic ic Tren ends and Attit itudes es towards Migr gration ion Globall lly

Sambuz

Useful Links

Newsletter

Mail Us