Empirical Methods in Natural Language Processing Lecture 1 - PDF document

Empirical Methods in Natural Language Processing Lecture 1 Introduction (I): Words and Probability Philipp Koehn Lecture given by Tommy Herbert 7 January 2008 PK EMNLP 7 January 2008 1 Welcome to EMNLP • Lecturer: Philipp Koehn • TA: Tommy Herbert • Lectures: Mondays and Thursdays, 17:10, DHT 4.18 • Practical sessions: 4 extra sessions • Project (worth 30%) will be given out next week • Exam counts for 70% of the grade PK EMNLP 7 January 2008

2 Outline • Introduction: Words, probability, information theory, n-grams and language modeling • Methods: tagging, finite state machines, statistical modeling, parsing, clustering • Applications: Word sense disambiguation, Information retrieval, text categorisation, summarisation, information extraction, question answering • Statistical machine translation PK EMNLP 7 January 2008 3 References • Manning and Sch¨ utze: ”Foundations of Statistical Language Processing”, 1999, MIT Press, available online • Jurafsky and Martin: ”Speech and Language Processing”, 2000, Prentice Hall. • Koehn: ”Statistical Machine Translation”, 2007, Cambridge University Press, not yet published. • also: research papers, other handouts PK EMNLP 7 January 2008

4 What are Empirical Methods in Natural Language Processing? • Empirical Methods: work on corpora using statistical models or other machine learning methods • Natural Language Processing: computational linguistics vs. natural language processing PK EMNLP 7 January 2008 5 Quotes It must be recognized that the notion ”probability of a sentence” is an entirely useless one, under any known interpretation of this term. Noam Chomsky, 1969 Whenever I fire a linguist our system performance improves. Frederick Jelinek, 1988 PK EMNLP 7 January 2008

6 Conflicts? • Scientist vs. engineer • Explaining language vs. building applications • Rationalist vs. empiricist • Insight vs. data analysis PK EMNLP 7 January 2008 7 Why is Language Hard? • Ambiguities on many levels • Rules, but many exceptions • No clear understand how humans process language → ignore humans, learn from data? PK EMNLP 7 January 2008

8 Language as Data A lot of text is now available in digital form • billions of words of news text distributed by the LDC • billions of documents on the web (trillion of words?) • ten thousands of sentences annotated with syntactic trees for a number of languages (around one million words for English) • 10s–100s of million words translated between English and other languages PK EMNLP 7 January 2008 9 Word Counts One simple statistic: counting words in Mark Twain’s Tom Sawyer : Word Count the 3332 and 2973 a 1775 to 1725 of 1440 was 1161 it 1027 in 906 that 877 from Manning+Sch¨ utze, page 21 PK EMNLP 7 January 2008

10 Counts of counts count count of count 1 3993 2 1292 • 3993 singletons (words that 3 664 occur only once in the text) 4 410 5 243 • Most words occur only a very 6 199 few times. 7 172 ... ... • Most of the text consists of 10 91 a few hundred high-frequency 11-50 540 words. 51-100 99 > 100 102 PK EMNLP 7 January 2008 11 Zipf’s Law Zipf’s law: f × r = k Rank r Word Count f f × r 1 the 3332 3332 2 and 2973 5944 3 a 1775 5235 10 he 877 8770 20 but 410 8400 30 be 294 8820 100 two 104 10400 1000 family 8 8000 8000 applausive 1 8000 PK EMNLP 7 January 2008

12 Probabilities • Given word counts we can estimate a probability distribution: count ( w ) P ( w ) = w ′ count ( w ′ ) P • This type of estimation is called maximum likelihood estimation . Why? We will get to that later. • Estimating probabilities based on frequencies is called the frequentist approach to probability. • This probability distribution answers the question: If we randomly pick a word out of a text, how likely will it be word w ? PK EMNLP 7 January 2008 13 A bit more formal • We introduced a random variable W . • We defined a probability distribution p , that tells us how likely the variable W is the word w : prob ( W = w ) = p ( w ) PK EMNLP 7 January 2008

14 Joint probabilities • Sometimes, we want to deal with two random variables at the same time. • Example: Words w 1 and w 2 that occur in sequence (a bigram ) We model this with the distribution: p ( w 1 , w 2 ) • If the occurrence of words in bigrams is independent , we can reduce this to p ( w 1 , w 2 ) = p ( w 1 ) p ( w 2 ) . Intuitively, this not the case for word bigrams. • We can estimate joint probabilities over two variables the same way we estimated the probability distribution over a single variable: count ( w 1 ,w 2 ) p ( w 1 , w 2 ) = P w 1 ′ ,w 2 ′ count ( w 1 ′ ,w 2 ′ ) PK EMNLP 7 January 2008 15 Conditional probabilities • Another useful concept is conditional probability p ( w 2 | w 1 ) It answers the question: If the random variable W 1 = w 1 , how what is the value for the second random variable W 2 ? • Mathematically, we can define conditional probability as p ( w 2 | w 1 ) = p ( w 1 ,w 2 ) p ( w 1 ) • If W 1 and W 2 are independent: p ( w 2 | w 1 ) = p ( w 2 ) PK EMNLP 7 January 2008

16 Chain rule • A bit of math gives us the chain rule: p ( w 2 | w 1 ) = p ( w 1 ,w 2 ) p ( w 1 ) p ( w 1 ) p ( w 2 | w 1 ) = p ( w 1 , w 2 ) • What if we want to break down large joint probabilities like p ( w 1 , w 2 , w 3 ) ? We can repeatedly apply the chain rule: p ( w 1 , w 2 , w 3 ) = p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) PK EMNLP 7 January 2008 17 Bayes rule • Finally, another important rule: Bayes rule p ( x | y ) = p ( y | x ) p ( x ) p ( y ) • It can easily derived from the chain rule: p ( x, y ) = p ( x, y ) p ( x | y ) p ( y ) = p ( y | x ) p ( x ) p ( x | y ) = p ( y | x ) p ( x ) p ( y ) PK EMNLP 7 January 2008

Empirical Methods in Natural Language Processing Lecture 1 - PDF document

Empirical Methods in Natural Language Processing Lecture 1 Introduction (I): Words and Probability Philipp Koehn Lecture given by Tommy Herbert 7 January 2008 PK EMNLP 7 January 2008 1 Welcome to EMNLP Lecturer: Philipp Koehn TA:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Empirical Methods in Natural Language Processing Lecture 4 Language Modeling (II): Smoothing and

Empirical Methods in Natural Language Processing Lecture 4 Language Modeling (II): Smoothing and

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Outline of todays lecture Natural Language Processing Lecture 1: Introduction Overview of the

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

Apache SystemML - Declarative Large-Scale Machine Learning Romeo Kienzler (IBM Waston IoT)

Cant We All Just Get Along? 233 East Redwood Street Todd R. Chason Baltimore, MD 21202

SETTING UP FOR SUCCESS: 15 THINGS NEW NONPROFITS SHOULD KNOW Mission Mission 1. Create a

questions & next steps frederick hirsch @fjhirsch matthias schunter Agenda (pm) Intro

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

1. Were the Founding Fathers mostly agnostics, deists, and secularists? 2. Is there any sense in

17/03/2020 George in 1680 1706 1 2 Statue in Hanover Georges brother Maximilian Wilhelm

CSC165 Week 7 Larry Zhang, October 21, 2014 announcements A1 marks on MarkUs class

Sambuz

Useful Links

Newsletter

Mail Us

Empirical Methods in Natural Language Processing Lecture 1 - PDF document

Empirical Methods in Natural Language Processing Lecture 1 Introduction (I): Words and Probability Philipp Koehn Lecture given by Tommy Herbert 7 January 2008 PK EMNLP 7 January 2008 1 Welcome to EMNLP Lecturer: Philipp Koehn TA:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Empirical Methods in Natural Language Processing Lecture 4 Language Modeling (II): Smoothing and

Empirical Methods in Natural Language Processing Lecture 4 Language Modeling (II): Smoothing and

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Outline of todays lecture Natural Language Processing Lecture 1: Introduction Overview of the

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

Apache SystemML - Declarative Large-Scale Machine Learning Romeo Kienzler (IBM Waston IoT)

Cant We All Just Get Along? 233 East Redwood Street Todd R. Chason Baltimore, MD 21202

SETTING UP FOR SUCCESS: 15 THINGS NEW NONPROFITS SHOULD KNOW Mission Mission 1. Create a

questions &amp; next steps frederick hirsch @fjhirsch matthias schunter Agenda (pm) Intro

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

1. Were the Founding Fathers mostly agnostics, deists, and secularists? 2. Is there any sense in

17/03/2020 George in 1680 1706 1 2 Statue in Hanover Georges brother Maximilian Wilhelm

CSC165 Week 7 Larry Zhang, October 21, 2014 announcements A1 marks on MarkUs class

Sambuz

Useful Links

Newsletter

Mail Us

questions & next steps frederick hirsch @fjhirsch matthias schunter Agenda (pm) Intro