Language Modeling, Efficiency/Training Tricks Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Language Modeling, Efficiency/Training Tricks Graham Neubig Site https://phontron.com/class/nn4nlp2020/

Are These Sentences OK? • Jane went to the store. • store to Jane went the. • Jane went store. • Jane goed to the store. • The store went to Jane. • The food truck went to Jane.

Language Modeling: Calculating the Probability of a Sentence I Y P ( X ) = P ( x i | x 1 , . . . , x i − 1 ) i =1 Next Word Context The big problem: How do we predict P ( x i | x 1 , . . . , x i − 1 ) ?!?!

Covered Concept Tally

Review: Count-based Language Models

Count-based Language Models • Count up the frequency and divide: c ( x i − n +1 , . . . , x i ) P ML ( x i | x i − n +1 , . . . , x i − 1 ) := c ( x i − n +1 , . . . , x i − 1 ) • Add smoothing, to deal with zero counts: P ( x i | x i − n +1 , . . . , x i − 1 ) = λ P ML ( x i | x i − n +1 , . . . , x i − 1 ) + (1 − λ ) P ( x i | x 1 − n +2 , . . . , x i − 1 ) • Modified Kneser-Ney smoothing

A Refresher on Evaluation • Log-likelihood:   X LL ( E test ) = log P ( E ) E ∈ E test • Per-word Log Likelihood:   1 X WLL ( E test ) = log P ( E ) P E ∈ E test | E | E ∈ E test • Per-word (Cross) Entropy:   1 X H ( E test ) = − log 2 P ( E ) P E ∈ E test | E | E ∈ E test • Perplexity:   ppl ( E test ) = 2 H ( E test ) = e − W LL ( E test )

What Can we Do w/ LMs? • Score sentences: Jane went to the store . → high store to Jane went the . → low (same as calculating loss for training) • Generate sentences: while didn’t choose end-of-sentence symbol: calculate probability sample a new word from the probability distribution

Problems and Solutions? • Cannot share strength among similar words she bought a car she bought a bicycle she purchased a car she purchased a bicycle → solution: class based language models • Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith → solution: skip-gram language models • Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → solution: cache, trigger, topic, syntactic models, etc.

An Alternative:   Featurized Log-Linear Models

An Alternative:   Featurized Models • Calculate features of the context • Based on the features, calculate probabilities • Optimize feature weights using gradient descent, etc.

Example: Previous words: “giving a" a 3.0 -6.0 -0.2 -3.2 the 2.5 -5.1 -0.3 -2.9 talk -0.2 0.2 1.0 1.0 b= w 1,a = w 2,giving = s= gift 0.1 0.1 2.0 2.2 hat 1.2 0.5 -1.2 0.6 … … … … … How likely How likely Words we’re How likely are they are they Total predicting are they? given prev. given 2nd prev. score word is “a”? word is “giving”?

Softmax • Convert scores into probabilities by taking the exponent and normalizing (softmax) e s ( x i | x i − 1 i − n +1 ) P ( x i | x i − 1 i − n +1 ) = x i | x i − 1 x i e s (˜ i − n +1 ) P ˜ -3.2 0.002 -2.9 0.003 1.0 s= p= 0.329 2.2 0.444 0.6 0.090 … …

A Computation Graph View giving a bias scores lookup2 lookup1 + + = probs softmax Each vector is size of output vocabulary

A Note: “Lookup” • Lookup can be viewed as “grabbing” a single vector from a big matrix of word embeddings num. words vector size lookup(2) • Similarly, can be viewed as multiplying by a “one- hot” vector 0 num. words 0 vector 1 * 0 size 0 … • Former tends to be faster

Training a Model • Reminder: to train, we calculate a “loss function” (a measure of how bad our predictions are), and move the parameters to reduce the loss • The most common loss function for probabilistic models is “negative log likelihood” 0.002 If element 3 0.003 (or zero-indexed, 2) p= 0.329 1.112 -log is the correct answer: 0.444 0.090 …

Parameter Update • Back propagation allows us to calculate the derivative of the loss with respect to the parameters @` @ θ • Simple stochastic gradient descent optimizes parameters according to the following rule θ ← θ − ↵ @` @ θ

Choosing a Vocabulary

Unknown Words • Necessity for UNK words • We won’t have all the words in the world in training data • Larger vocabularies require more memory and computation time • Common ways: • Frequency threshold (usually UNK <= 1) • Rank threshold

Evaluation and Vocabulary • Important: the vocabulary must be the same over models you compare • Or more accurately, all models must be able to generate the test set (it’s OK if they can generate more than the test set, but not less) • e.g. Comparing a character-based model to a word-based model is fair, but not vice-versa

Let’s try it out! ( loglin-lm.py )

What Problems are Handled? • Cannot share strength among similar words she bought a bicycle she bought a car she purchased a car she purchased a bicycle → not solved yet 😟 • Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith → solved! 😁 • Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → not solved yet 😟

Beyond Linear Models

Linear Models can’t Learn Feature Combinations students take tests → high teachers take tests → low students write tests → low teachers write tests → high • These can’t be expressed by linear features • What can we do? • Remember combinations as features (individual scores for “students take”, “teachers write”)   → Feature space explosion! • Neural nets

Neural Language Models • (See Bengio et al. 2004) giving a lookup lookup tanh(   W 1 *h + b 1 ) + = W softmax probs bias scores

Where is Strength Shared? giving a Similar output words lookup lookup get similar rows in in the softmax matrix Similar contexts get tanh(   W 1 *h + b 1 ) similar hidden states + = W Word embeddings: softmax Similar input words probs bias scores get similar vectors

What Problems are Handled? • Cannot share strength among similar words she bought a bicycle she bought a car she purchased a car she purchased a bicycle → solved, and similar contexts as well! 😁 • Cannot condition on context with intervening words Dr. Jane Smith Dr. Gertrude Smith → solved! 😁 • Cannot handle long-distance dependencies for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → not solved yet 😟

Let’s Try it Out! ( nn-lm.py )

Tying Input/Output Embeddings giving a • We can share parameters between the input and output pick row pick row embeddings (Press et al. 2016, inter alia) tanh(   W 1 *h + b 1 ) + = W softmax probs bias scores Want to try? Delete the input embeddings, and instead pick a row from the softmax matrix.

Optimizers

Language Modeling, Efficiency/Training Tricks Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Language Modeling, Efficiency/Training Tricks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Are These Sentences OK? Jane went to the store. store to Jane went the. Jane went store.

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Energy Efficiency Modeling Discussion October 14th, 2016 2 Major Energy Efficiency Modeling

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

The Agile PMP: Teaching an Old Dog New Tricks The Agile PMP: Teaching an Old Dog New Tricks

Cute Tricks with Cute Tricks with Virtual Memory Virtual Memory A short history of VM A short

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

Tricks for Statistical Semantic Tricks for Statistical Semantic Knowledge Discovery: Knowledge

Productivity tips & tricks Productivity tips & tricks for a developer for a developer

TRICKS AND TRAPS OF THE CO-OPS ACT TRICKS AND TRAPS OF THE CO-OPS ACT THE NAME TRAP a

TEACHING OLD COMPILERS NEW TRICKS TEACHING OLD COMPILERS NEW TRICKS Transpiling C ++ 17 to C ++ 11

Teaching old type systems Teaching old type systems new tricks with type providers new tricks

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Diversification, Efficiency, and Diversification, Efficiency, and Diversification, Efficiency,

El Paso Electric El Paso Electric Energy Efficiency Energy Efficiency Standard Offer Programs -

Question 1: 2 Find the Specific Gravity of Given: W T =318 kg W S =204 kg V T

Maximum-Likelihood Estimation The EM algorithm based on a presentation by Dan Klein We have

Applications of Lattices in Telecommunications Amin Sakzad Dept of Electrical and Computer

Natural Language Processing and Information Retrieval Indexing and Vector Space Models

Chemistry 120 Fall 2016 Instructor: Dr. Upali Siriwardane e-mail: upali@latech.edu Office: CTH

IEPs that Work SESSION 4 SHELLEY MOORE Today A nice start Sharing what we tried Reviewing

Deep Semantic Matching for Amazon Product Search Yi Yiwei ei So Song ng Amazon Product

Polynomial, sparse and low-rank approximations Anthony Nouy Centrale Nantes Laboratoire de

Language Modeling, Efficiency/Training Tricks Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Language Modeling, Efficiency/Training Tricks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Are These Sentences OK? Jane went to the store. store to Jane went the. Jane went store.

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Energy Efficiency Modeling Discussion October 14th, 2016 2 Major Energy Efficiency Modeling

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

The Agile PMP: Teaching an Old Dog New Tricks The Agile PMP: Teaching an Old Dog New Tricks

Cute Tricks with Cute Tricks with Virtual Memory Virtual Memory A short history of VM A short

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

Tricks for Statistical Semantic Tricks for Statistical Semantic Knowledge Discovery: Knowledge

Productivity tips &amp; tricks Productivity tips &amp; tricks for a developer for a developer

TRICKS AND TRAPS OF THE CO-OPS ACT TRICKS AND TRAPS OF THE CO-OPS ACT THE NAME TRAP a

TEACHING OLD COMPILERS NEW TRICKS TEACHING OLD COMPILERS NEW TRICKS Transpiling C ++ 17 to C ++ 11

Teaching old type systems Teaching old type systems new tricks with type providers new tricks

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Diversification, Efficiency, and Diversification, Efficiency, and Diversification, Efficiency,

El Paso Electric El Paso Electric Energy Efficiency Energy Efficiency Standard Offer Programs -

Question 1: 2 Find the Specific Gravity of Given: W T =318 kg W S =204 kg V T

Maximum-Likelihood Estimation The EM algorithm based on a presentation by Dan Klein We have

Applications of Lattices in Telecommunications Amin Sakzad Dept of Electrical and Computer

Natural Language Processing and Information Retrieval Indexing and Vector Space Models

Chemistry 120 Fall 2016 Instructor: Dr. Upali Siriwardane e-mail: upali@latech.edu Office: CTH

IEPs that Work SESSION 4 SHELLEY MOORE Today A nice start Sharing what we tried Reviewing

Deep Semantic Matching for Amazon Product Search Yi Yiwei ei So Song ng Amazon Product

Polynomial, sparse and low-rank approximations Anthony Nouy Centrale Nantes Laboratoire de

Productivity tips & tricks Productivity tips & tricks for a developer for a developer