Deep Learning Basics Lecture 10: Neural Language Models
Princeton University COS 495 Instructor: Yingyu Liang
Lecture 10: Neural Language Models Princeton University COS 495 - - PowerPoint PPT Presentation
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the
Princeton University COS 495 Instructor: Yingyu Liang
duck.”
𝑦1 𝑦2 𝑦3 𝑦4 𝑦5 𝑦6 𝑦7 𝑦8 𝑦9 Tokens:
(𝑦1, 𝑦2, 𝑦3, … , 𝑦𝜐−1, 𝑦𝜐) P [𝑦1, 𝑦2, 𝑦3, … , 𝑦𝜐−1, 𝑦𝜐]
given the preceding 𝑜 − 1 tokens P 𝑦1, 𝑦2, … , 𝑦𝜐 = P 𝑦1, … , 𝑦𝑜−1 ෑ
𝑢=𝑜 𝜐
P[𝑦𝑢|𝑦𝑢−𝑜+1, … , 𝑦𝑢−1]
given the preceding 𝑜 − 1 tokens P 𝑦1, 𝑦2, … , 𝑦𝜐 = P 𝑦1, … , 𝑦𝑜−1 ෑ
𝑢=𝑜 𝜐
P[𝑦𝑢|𝑦𝑢−𝑜+1, … , 𝑦𝑢−1] Markovian assumptions
For all grams (𝑦𝑢−𝑜+1, … , 𝑦𝑢−1, 𝑦𝑢)
P[𝑦𝑢−𝑜+1, … , 𝑦𝑢−1, 𝑦𝑢]
P 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1
P 𝑦𝑢 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1 = P 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1, 𝑦𝑢 P 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1
P 𝑢ℎ𝑓 𝑒𝑝 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = P 𝑢ℎ𝑓 𝑒𝑝 𝑠𝑏𝑜 P[𝑏𝑥𝑏𝑧|𝑒𝑝 𝑠𝑏𝑜] P 𝑢ℎ𝑓 𝑒𝑝 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = P 𝑢ℎ𝑓 𝑒𝑝 𝑠𝑏𝑜 P[𝑒𝑝 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] P[𝑒𝑝 𝑠𝑏𝑜]
P … most likely to be 0
P[𝑒𝑝 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = 0
P[𝑒𝑝 𝑠𝑏𝑜] = 0
P[𝑏𝑥𝑏𝑧|𝑒𝑝 𝑠𝑏𝑜] does not work, use P 𝑏𝑥𝑏𝑧 𝑠𝑏𝑜 as replacement
P 𝑏𝑥𝑏𝑧 𝑠𝑏𝑜 and P[𝑏𝑥𝑏𝑧|𝑒𝑝 𝑠𝑏𝑜]
each token with its class
by using a distributed representation of words
called word embedding)
1
𝑗-th entry
logic/grammar derivation, or discrete probabilistic model)
company it keeps”
𝑄[𝑥, 𝑥′] 𝑥 𝑥′ 𝑤𝑥 ≔ 𝑄[𝑥, : ]
company it keeps”
𝑄[𝑥, 𝑑] 𝑥 𝑑 𝑤𝑥 ≔ 𝑄[𝑥, : ] Can replace with context such as a phrase
row vector for the word 𝑄[𝑥, 𝑥′] 𝑁 𝑌 𝑍 𝑥 𝑥
row vector for the word 𝑄 𝑥, 𝑥′ 𝑁 𝑌 𝑍 𝑥 𝑥 Or PMI w, w′ = ln
𝑄[𝑥,𝑥′] 𝑄[𝑥] 𝑄[𝑥′]
Updated on April 2016
Figure from Efficient Estimation of Word Representations in Vector Space, By Mikolov, Chen, Corrado, Dean
P 𝑥𝑢 𝑥𝑢−2, … , 𝑥𝑢+2 ∝ exp[𝑤𝑥𝑢 ⋅ 𝑛𝑓𝑏𝑜 𝑤𝑥𝑢−2, … , 𝑤𝑥𝑢+2 ]
𝑤𝑛𝑏𝑜 − 𝑤𝑥𝑝𝑛𝑏𝑜 ≈ 𝑤𝑙𝑗𝑜 − 𝑤𝑟𝑣𝑓𝑓𝑜 𝑤𝑠𝑣𝑜 − 𝑤𝑠𝑣𝑜𝑜𝑗𝑜 ≈ 𝑤𝑥𝑏𝑚𝑙 − 𝑤𝑥𝑏𝑚𝑙𝑗𝑜
𝑥𝑗
′𝑡 are bias terms, 𝑔 𝑦 = 𝑛𝑗𝑜{100, 𝑦3/4}
Lots of mysterious things What are the reasons behind
What are the connections between them? A unified framework? Why do the word vector have linear structure for analogies?
RAND-WALK: A Latent Variable Model Approach to Word Embeddings
Can’t miss!