Smoothing Statistical NLP � We often want to make estimates from sparse statistics: Spring 2011 P(w | denied the) 3 allegations allegations 2 reports 1 claims reports charges benefits motion 1 request claims request … 7 total � Smoothing flattens spiky distributions so they generalize better P(w | denied the) 2.5 allegations 1.5 reports allegations allegations 0.5 claims charges benefits motion Lecture 3: Language Models II 0.5 request reports … 2 other claims request 7 total Dan Klein – UC Berkeley � Very important all over NLP, but easy to do badly! � We’ll illustrate with bigrams today (h = previous word, could be anything). Kneser-Ney Predictive Distributions � Parameter estimation: � Kneser-Ney smoothing combines these two ideas � Absolute discounting θ = P(w) = [a:0.5, b:0.25, c:0.25] a b a c � With parameter variable: � Lower order continuation probabilities Θ a b a c � KN smoothing repeatedly proven effective � Predictive distribution: � Why should things work like this? Θ a b a c W Hierarchical Models “Chinese Restaurant” Processes c a b d a b e a b f a b Θ 0 Dirichlet Process Pitman-Yor Process Θ a Θ b Θ c Θ d Θ e Θ f Θ g P ���� �� ����� k � ∝ c � P ���� �� ����� k � ∝ c � − d P ���� �� ��� ������ ∝ α P ���� �� ��� ������ ∝ α � dK b b b b d e f a a a a P ������ ������ ���� w � � θ � � w � � � /V [Teh, 06, diagrams from Teh] [MacKay and Peto, 94, Teh 06] 1
What Actually Works? Data >> Method? � Trigrams and beyond: � Having more data is better… � Unigrams, bigrams generally useless 10 � Trigrams much better (when 9.5 100,000 Katz there’s enough data) 9 � 100,000 KN 4-, 5-grams really useful in MT, but not so much for 8.5 1,000,000 Katz Entropy speech 8 1,000,000 KN 7.5 � 10,000,000 Katz Discounting 7 10,000,000 KN � Absolute discounting, Good- Turing, held-out estimation, 6.5 all Katz Witten-Bell 6 all KN 5.5 � Context counting 1 2 3 4 5 6 7 8 9 10 20 � Kneser-Ney construction n-gram order oflower-order models [Graphs from � � Joshua Goodman] … but so is using a better estimator See [Chen+Goodman] reading for tons of graphs! � Another issue: N > 3 has huge costs in speech and MT decoders Tons of Data? Large Scale Methods � Language models get big, fast � English Gigawords corpus: 2G tokens, 0.3G trigrams, 1.2G 5-grams � Google N-grams: 13M unigrams, 0.3G bigrams, ~1G 3-, 4-, 5-grams � Need to access entries very often, ideally in memory � What do you do when language models get too big? � Distributing LMs across machines � Quantizing probabilities � Random hashing (e.g. Bloom filters) [Talbot and Osborne 07] [Brants et al, 2007] A Simple Java Hashmap? Word+Context Encodings Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing 2
Word+Context Encodings Compression Memory Requirements Speed and Caching Full LM LM Interfaces Approximate LMs � Simplest option: hash-and-hope � Array of size K ~ N � (optional) store hash of keys � Store values in direct-address or open addressing � Collisions: store the max � What kind of errors can there be? � More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc 3
Beyond N-Gram LMs � Lots of ideas we won’t have time to discuss: � Caching models: recent words more likely to appear again � Trigger models: recent words trigger other words � Topic models � A few other classes of ideas � Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98] � Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT] � Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero if the look like real zeros [Mohri and Roark, 06] � Bayesian document and IR models [Daume 06] 4
Recommend
More recommend