Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model - PowerPoint PPT Presentation

Natural Language Processing CSCI 4152/6509 — Lecture 17 N-gram Model Smoothing Instructor: Vlado Keselj Time and date: 09:35–10:25, 14-Feb-2020 Location: Dunn 135 CSCI 4152/6509, Vlado Keselj Lecture 17 1 / 17

Previous Lecture N-gram model ◮ Language modeling N-gram model assumption Graphical Representation N-gram model as Markov chain Perplexity Text classification using language modeling Reading: [JM] Ch4 N-Grams CSCI 4152/6509, Vlado Keselj Lecture 17 2 / 17

N-gram Model Smoothing Smoothing is used to avoid probability 0 due to sparse data Some smoothing methods: ◮ Add-one smoothing (Laplace smoothing) ◮ Bell-Witten smoothing ◮ Good-Turing smoothing ◮ Kneser-Ney smoothing (new edition of [JM]) CSCI 4152/6509, Vlado Keselj Lecture 17 3 / 17

Example: Character Unigram Probabilities Training example: mississippi What are letter unigram probabilities? What would be probability of the word ‘ river ’ based on this model? CSCI 4152/6509, Vlado Keselj Lecture 17 4 / 17

Add-one Smoothing (Laplace Smoothing) Idea: Start with count 1 for all events V = vocabulary size (unique tokens) N = length of text in tokens Smoothed unigram probabilities: P ( w ) = #( w ) + 1 N + V Smoothed bi-gram probabilities P ( a | b ) = #( ba ) + 1 #( b ) + V CSCI 4152/6509, Vlado Keselj Lecture 17 5 / 17

Mississippi Example: Add-one Smoothing Let us again consider the example trained on the word: mississippi What are letter unigram probabilities with add-one smoothing? What is the probability of: river CSCI 4152/6509, Vlado Keselj Lecture 17 6 / 17

Witten-Bell Discounting Idea from data compression (Witten and Bell 1991) Encode tokens as numbers as they are read Use special (escape) code to introduce new token Frequency of ‘escape’ is probability of unseen events Consider again example: mississippi What is the probability of: river CSCI 4152/6509, Vlado Keselj Lecture 17 7 / 17

Witten-Bell Discounting: Formulae Modified unigram probability P ( w ) = #( w ) n + r Probability of unseen tokens: r P ( w ) = ( n + r )( | V | − r ) CSCI 4152/6509, Vlado Keselj Lecture 17 8 / 17

Higher-order N-grams Modified probability for seen bigrams #( ba ) P ( a | b ) = #( b ) + r b Remaining probability mass for unseen events r b #( b ) + r b Estimate for unseen bigrams starting with b ( N b is the set of tokens that never follow b in training text): r b P ( a | b ) = · P ( a ) / Σ x ∈ N b P ( x ) #( b ) + r b CSCI 4152/6509, Vlado Keselj Lecture 17 9 / 17

The Next Model: HMM HMM — Hidden Markov Model Typically used to annotate sequences of tokens Most common annotation: Part-of-Speech Tags (POS Tags) First, we will make a review of parts of speech in English CSCI 4152/6509, Vlado Keselj Lecture 17 10 / 17

Part-of-Speech Tags (POS Tags) Reading: Sections 5.1–5.2 (Ch. 8 in new edition) Word classes called Part-of-Speech (POS) classes ◮ also known as syntactic categories , grammatical categories , or lexical categories Ambiguous example: Time flies like an arrow. Time flies like an arrow. 1. N V P D N 2. N N V D N . . . POS tags: labels to indicate POS class POS tagging: task of assigning POS tags CSCI 4152/6509, Vlado Keselj Lecture 17 11 / 17

POS Tag Sets Traditionally based on Ancient Greece source: eight parts of speech: ◮ nouns, verbs, pronouns, prepositions, adverbs, conjunctions, participle, and articles Computer processing introduced a need for a large set of categories Useful in NLP, e.g.: named entity recognition, information extraction Various POS tag sets (in NLP): Brown Corpus, Penn Treebank, CLAWS, C5, C7, . . . We will use the Pen Treebank system of tags CSCI 4152/6509, Vlado Keselj Lecture 17 12 / 17

WSJ Dataset WSJ — Wall Street Journal data set Most commonly used to train and test POS taggers Consists of 25 sections, about 1.2 million words Example: Pierre NNP Vinken NNP , , 61 CD years NNS old JJ , , will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD . . Mr. NNP Vinken NNP is VBZ chairman NN of IN Elsevier NNP N.V. NNP , , the DT Dutch NNP publishing VBG group NN . . Rudolph NNP Agnew NNP , , 55 CD years NNS old JJ and CC former JJ chairman NN of IN Consolidated NNP Gold NNP Fields NNP PLC NNP , , was VBD named VBN CSCI 4152/6509, Vlado Keselj Lecture 17 13 / 17

Open and Closed Categories Word POS categories are divided into two sets: open and closed categories: open categories ◮ dynamic set ◮ content words ◮ larger set ◮ e.g.: nouns, verbs, adjectives closed categories or functional categories : ◮ fixed set ◮ small set ◮ frequent words ◮ e.g.: articles, auxiliaries, prepositions CSCI 4152/6509, Vlado Keselj Lecture 17 14 / 17

Open Word Categories nouns (NN, NNS, NNP, NNPS) ◮ concepts, objects, people, and similar adjectives (JJ, JJR, JJS) ◮ modify (describe) nouns verbs (VB, VBP, VBZ, VBG, VBD, VBN) ◮ actions adverbs (RB, RBR, RBS) ◮ modify verbs, but other words too CSCI 4152/6509, Vlado Keselj Lecture 17 15 / 17

Nouns (NN, NNS, NNP, NNPS) Nouns refer to people, animals, objects, concepts, and similar. Features: number: singular, plural case: subject (nominative), object (accusative) Some languages have more cases, and more number values Some languages have grammatical gender CSCI 4152/6509, Vlado Keselj Lecture 17 16 / 17

Noun Tags and Examples NN for common singular nouns; e.g., company, year, market NNS for common plural nouns; e.g., shares, years, sales, prices, companies NNP for proper nouns (names); e.g., Bush, Japan, Federal, New York, Corp, Mr., Friday, James A. Talcott (“James NNP A. NNP Talcott NNP”) NNPS for proper plural nouns; e.g., Canadians, Americans, Securities, Systems, Soviets, Democrats CSCI 4152/6509, Vlado Keselj Lecture 17 17 / 17

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model - PowerPoint PPT Presentation

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor: Vlado Keselj Time and date: 09:3510:25, 14-Feb-2020 Location: Dunn 135 CSCI 4152/6509, Vlado Keselj Lecture 17 1 / 17 Previous Lecture N-gram

Natural Language Processing CSCI 4152/6509 Lecture 1 Course Introduction Instructor: Vlado

Natural Language Processing CSCI 4152/6509 Lecture 2 Introduction to Natural Language

Natural Language Processing CSCI 4152/6509 Lecture 7 Perl Processing Examples Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 29 Context-Free Grammars for Natural

Natural Language Processing CSCI 4152/6509 Lecture 31 Introduction to Semantic Processing

Natural Language Processing CSCI 4152/6509 Lecture 6 Regular Expressions; Text Processing in

Natural Language Processing CSCI 4152/6509 Lecture 27 Parsing with Prolog Instructor: Vlado

Natural Language Processing CSCI 4152/6509 Lecture 9 Elements of Morphology Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 26 CFGs and CYK Parsing Algorithm

Natural Language Processing CSCI 4152/6509 Lecture 14 Probabilistic Modeling Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 12 Classifier Evaluation Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 11 IR Measures and Text Mining

Natural Language Processing CSCI 4152/6509 Lecture 30 Efficient PCFG Inference Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Natural Language Processing CSCI 4152/6509 Lecture 10 Elements of Information Retrieval

Natural Language Processing CSCI 4152/6509 Lecture 18 POS Tags; Hidden Markov Model (HMM)

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Natural Language Processing Lecture 8: Parts of Speech My cat who lives dangerously no longer

Naturally Mineralized Systems and Mineralisations, a natural analogue of Enhanced Geothermal

Objec bject t detec detecti tion on CV3DST | Prof. Leal-Taix 1 Ta Task k defini niti

Named Entity Recognition Katharine Jarmul Founder, kjamistan DataCamp Introduction to Natural

Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 21, 2017

Syntax-Based Decoding Philipp Koehn 9 November 2017 Philipp Koehn Machine Translation:

Part-of-Speech Tagging COSI 114 Computational Linguistics James Pustejovsky March 17, 2017