language models
play

Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Language Models What are language models?


  1. CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  2. Language Models • What are language models? • Statistical language models – Unigram, bigram and n-gram language model • Neural language models 2 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  3. Language Models: Objective • Key question: How well does a model represent the language? – Character language model: Given alphabet vocabulary V, models the probability of generating strings in the language – Word language model: Given word vocabulary V, models the probability of generating sentences in the language 3 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  4. Language Model: Applications • Assign a probability to sentences – Machine translation: P( high wind tonight) > P( large wind tonight) • – Spell correction: The office is about fifteen minuets from my house • P(about fifteen minutes from) > P(about fifteen minuets from) • – Speech recognition: P(I saw a van) >> P(eyes awe of an) • – Information retrieval: use words that you expect to find in matching documents as your query – Many more: Summarization, question-answering , and more 4 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  5. Language Models • What are language models? • Statistical language models • Neural language models 5 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  6. Language Model: Definition • Goal: Compute the probability of a sentence or sequence of words P(s) = P(w 1 , w 2 , … w n ) • Related task: Probability of an upcoming word: P(w 5 | w 1 , w 2 , w 3 , w 4 ) • A model that computes either of these is a language model • How to compute the joint probability? – Intuition: apply the chain rule 6 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  7. How To Compute Sentence Probability? • Given sentence s = t 1 t 2 t 3 t 4 • Applying the chain rule under language model M 7 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  8. Complexity of Language Models • The complexity of language models depends on the window of the word-word or character-character dependency they can handle • Common types are: – Unigram language model – Bigram language model – N-gram language model 8 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  9. Unigram Model • Unigram language model only models the probability of each word according to the model – Does NOT model word-word dependency – The word order is irrelevant – Akin to the “bag of words” model 9 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  10. Bigram Model • Bigram language model models the consecutive word dependency – Does NOT model longer dependency – Word order is relevant here 10 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  11. N-gram Model • Bigram language model models the longer sequences of word dependency – Most complex among all three 11 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  12. Unigram Language Model: Example • What is the probability of the sentence s under language model M? • Example: Word Probability the 0.2 “the man likes the woman” a 0.1 0.2 x 0.01 x 0.02 x 0.2 x 0.01 man 0.01 = 0.00000008 woman 0.01 P (s | M) = 0.00000008 said 0.03 likes 0.02 Language Model M 12 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  13. Comparing Language Models • Given two language models, how can we decide which language model is better? • Solution: – Take a set S of sentences we desire to model – For each language model: Find the probability of each sentence • Average the probability scores • – The language model with the highest average probability is the best fit for language model 13 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  14. Comparing Language Models • s: “the man likes the woman” • M1: 0.2 x 0.01 x 0.02 x 0.2 x 0.01 è P(s|M1) = 0.00000008 • M2 : 0.1 x 0.1 x 0.01 x 0.1 x 0.1 è P(s|M2) = 0.000001 • P(s|M2) > P(s|M1) è M2 is a better language model Word Probability Word Probability the 0.2 the 0.1 Language Language a 0.1 a 0.02 Model M1 Model M2 man 0.01 man 0.1 woman 0.01 woman 0.1 said 0.03 said 0.02 likes 0.02 likes 0.01 14 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  15. Estimating Probabilities • N-gram conditional probability can be estimated based on the raw occurrence counts in the observed corpus • Uni-gram • Bi-gram • N-gram 15 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  16. Estimating Bigram Probabilities: Case Study • Corpus: Berkeley Restaurant Project sentences – – – – – – 16 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  17. Raw Bigram Counts: Case Study • Bigram matrix created from 9222 sentences 17 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  18. Raw Bigram Probabilities: Case Study • Unigram counts • Normalize by unigrams P(want | i) = C(i, want)/C(i) = 827 / 2533 = 0.33 18 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  19. Language Models • What are language models? • Statistical language models – Unigram, bigram, and n-gram language models • Neural language models • Language models for IR 19 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  20. Neural Language Models • So far, the language models have been statistics and counting based • Now, language models are created using neural networks/deep learning • Key question: how to model sequences? 20 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  21. Neural-based Bigram Language Mode 1-hot encoding 1-hot encoding 1-hot encoding 1-hot encoding Problem: Does not model sequential information (too local) 21 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  22. Sequences in Inputs or Outputs? 22 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  23. Sequences in Inputs or Outputs? 23 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  24. Key Conceptual Ideas • Parameter Sharing – in computational graphs = adding gradients • “Unrolling” – in computational graphs with parameter sharing • Parameter Sharing + “Unrolling” – Allows modeling arbitrary length sequences! – Keeps number of parameters in check 24 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  25. Recurrent Neural Network 25 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  26. Recurrent Neural Network • We can process a sequence of vectors x by applying a recurrence formula at every time step • f W is used at every time step and shared across all data 26 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  27. (Vanilla) Recurrent Neural Network Learned matrix weights 27 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  28. RNN Computational Graph Initial hidden state Final hidden state Input at time 1 Input at time 2 Input at time 3 28 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  29. RNN Computational Graph • The same weight matrices W is shared for all time steps Shared weight matrix 29 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  30. RNN Computational Graph: Many to Many • Many-to-many architecture has one output per time step Output at time 3 Final output Output at time 1 Output at time 2 30 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  31. RNN Computational Graph: Many to Many Loss at time 1 Loss at time 2 Loss at time 3 Final loss 31 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  32. RNN Computational Graph: Many to Many Total loss 32 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  33. RNN Computational Graph: Many to one • Many-to-one architecture has one final output 33 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  34. RNN Computational Graph: One to many • Many-to-one architecture has one input and several outputs 34 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  35. Example : Character-level Language Model • Input: one hot representation of the characters • Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 35 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  36. Example : Character-level Language Model • Transform every input into the hidden vector 36 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  37. Example : Character-level Language Model • Transform each hidden vector into a output vector 37 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Recommend


More recommend