Introduction to Artificial Intelligence Natural Language Processing Janyl Jumadinova November 14, 2016 Credit: NLP Stanford
Question Answering: IBM’s Watson 2/25
Information Extraction 3/25
Sentiment Extraction 4/25 Source: Washington Post
Machine Translation 5/25
Language Technology 6/25
Ambiguity makes NLP hard 7/25
Ambiguity makes NLP hard ◮ Teacher Strikes Idle Kids ◮ Red Tape Holds Up New Bridges ◮ Juvenile Court to Try Shooting Defendant ◮ Local High School Dropouts Cut in Half 7/25
Other NLP Difficulties 8/25
Progress ◮ What tools do we need? ◮ Knowledge about language ◮ Knowledge about the world ◮ A way to combine knowledge sources 9/25
Progress ◮ What tools do we need? ◮ Knowledge about language ◮ Knowledge about the world ◮ A way to combine knowledge sources ◮ How we generally do this: ◮ Probabilistic models built from language data ◮ P(“maison” → “house”) → high ◮ P(“L’avocat general” → “the general avocado”) → low 9/25
Basic Text Processing Regular Expressions ◮ A formal language for specifying text strings. 10/25
Basic Text Processing Regular Expressions ◮ A formal language for specifying text strings. ◮ How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks 10/25
Regular Expressions: Disjunctions 11/25
Regular Expressions: Negation in Disjunction ◮ Negations [ ∧ Ss ] ◮ Carat means negation only when first in [] 12/25
Regular Expressions: More Disjunction ◮ Woodchucks is another name for groundhog! ◮ The pipe | for disjunction 13/25
Regular Expressions: ? * + . 14/25
Regular Expressions: Example Find all instances of the word “the” in a text 15/25
Basic Text Processing Word tokenization Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text 16/25
How Many Words? 17/25
Simple Tokenization in UNIX 18/25
Basic Text Processing Normalization Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text 19/25
Issues in Tokenization ◮ Finland’s capital → Finland Finlands Finland’s ◮ what’re, I’m, isn’t → What are, I am, is not ◮ Hewlett-Packard → Hewlett Packard ◮ state-of-the-art → state of the art ◮ Lowercase → lower-case lowercase lower case ◮ San Francisco → one token or two? 20/25
Issues in Tokenization ◮ Finland’s capital → Finland Finlands Finland’s ◮ what’re, I’m, isn’t → What are, I am, is not ◮ Hewlett-Packard → Hewlett Packard ◮ state-of-the-art → state of the art ◮ Lowercase → lower-case lowercase lower case ◮ San Francisco → one token or two? ◮ Language Issues : French, German, Japanese, Chinese,... 20/25
Basic Text Processing Stemming Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text 21/25
Stemming ◮ Reduce terms to their stems in information retrieval ◮ Stemming is crude chopping of affixes language dependent ◮ Example: automate(s) , automatic , automation all reduced to automat . 22/25
Porter’s Algorithm Most common English stemmer. 23/25
Sentence Segmentation ◮ !, ? are relatively unambiguous 24/25
Sentence Segmentation ◮ !, ? are relatively unambiguous ◮ Period “.” is quite ambiguous - Sentence boundary - Abbreviations like Inc. or Dr. - Numbers like .02 or 4.3 24/25
Sentence Segmentation ◮ !, ? are relatively unambiguous ◮ Period “.” is quite ambiguous - Sentence boundary - Abbreviations like Inc. or Dr. - Numbers like .02 or 4.3 ◮ Build a binary classifier - Classifiers: hand-written rules, regular expressions, or machine-learning 24/25
Determining if a word is end-of-sentence: a Decision Tree 25/25
Recommend
More recommend