kenlm faster and smaller language model queries
play

KenLM: Faster and Smaller Language Model Queries Kenneth Heafield - PowerPoint PPT Presentation

Backoff Models Data Structures Results KenLM: Faster and Smaller Language Model Queries Kenneth Heafield heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm Heafield KenLM: Faster and Smaller Language Model Queries


  1. Backoff Models Data Structures Results KenLM: Faster and Smaller Language Model Queries Kenneth Heafield heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm Heafield KenLM: Faster and Smaller Language Model Queries

  2. Backoff Models Data Structures Results What KenLM Does Answer language model queries using less time and memory. log p ( < s > → iran) = -3.33437 log p ( < s > iran → is ) = -1.05931 log p ( < s > iran is → one) = -1.80743 log p ( < s > iran is one → of ) = -0.03705 log p ( iran is one of → the ) = -0.08317 log p ( is one of the → few) = -1.20788 Heafield KenLM: Faster and Smaller Language Model Queries

  3. Backoff Models Data Structures Results Related Work Downloadable Baselines SRI Popular and considered fast but high-memory IRST Open source, low-memory, single-threaded Rand Low-memory lossy compression MIT Mostly estimates models but also does queries Papers Without Code TPT Better memory locality Sheffield Lossy compression techniques Heafield KenLM: Faster and Smaller Language Model Queries

  4. Backoff Models Data Structures Results Related Work Downloadable Baselines SRI Popular and considered fast but high-memory IRST Open source, low-memory, single-threaded Rand Low-memory lossy compression MIT Mostly estimates models but also does queries Papers Without Code TPT Better memory locality Sheffield Lossy compression techniques After KenLM’s Public Release Berkeley Java; slower and larger than KenLM Heafield KenLM: Faster and Smaller Language Model Queries

  5. Backoff Models Data Structures Results Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Heafield KenLM: Faster and Smaller Language Model Queries

  6. Backoff Models Data Structures Results Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Bigger models Conserve memory Heafield KenLM: Faster and Smaller Language Model Queries

  7. Backoff Models Data Structures Results Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Bigger models Conserve memory SRI doesn’t compile Distribute and compile with decoders Heafield KenLM: Faster and Smaller Language Model Queries

  8. Backoff Models Data Structures State Results Outline Backoff Models 1 State Data Structures 2 Probing Trie Chop Results 3 Perplexity Translation Heafield KenLM: Faster and Smaller Language Model Queries

  9. Backoff Models Data Structures State Results Example Language Model Unigrams Bigrams Trigrams Words log p Back Words log p Back Words log p < s > - ∞ -2.0 < s > iran -3.3 -1.2 < s > iran is -1.1 iran -4.1 -0.8 iran is -1.7 -0.4 iran is one -2.0 is -2.5 -1.4 is one -2.0 -0.9 is one of -0.3 one -3.3 -0.9 one of -1.4 -0.6 of -2.5 -1.1 Heafield KenLM: Faster and Smaller Language Model Queries

  10. Backoff Models Data Structures State Results Example Queries Unigrams Bigrams Trigrams Words log p Back Words log p Back Words log p < s > - ∞ -2.0 < s > iran -3.3 -1.2 < s > iran is -1.1 iran -4.1 -0.8 iran is -1.7 -0.4 iran is one -2.0 is -2.5 -1.4 is one -2.0 -0.9 is one of -0.3 one -3.3 -0.9 one of -1.4 -0.6 of -2.5 -1.1 Query: < s > iran is Query: iran is of log p ( < s > iran → is) = -1.1 log p (of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p (iran is → of) = -4.3 Heafield KenLM: Faster and Smaller Language Model Queries

  11. Backoff Models Data Structures State Results Lookups Performed by Queries < s > iran is iran is of Lookup Lookup 1 is 1 of 2 iran is 2 is of (not found) 3 is 3 < s > iran is 4 iran is Score Score log p ( < s > iran → is) = -1.1 log p (of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p (iran is → of) = -4.3 Heafield KenLM: Faster and Smaller Language Model Queries

  12. Backoff Models Data Structures State Results Lookups Performed by Queries < s > iran is iran is of Lookup Lookup 1 is 1 of 2 iran is 2 is of (not found) 3 is 3 < s > iran is 4 iran is Score Score log p ( < s > iran → is) = -1.1 log p (of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p (iran is → of) = -4.3 Heafield KenLM: Faster and Smaller Language Model Queries

  13. Backoff Models Data Structures State Results Lookups Performed by Queries < s > iran is iran is of Lookup Lookup 1 is 1 of State 2 iran is 2 is of (not found) Backoff(is) 3 is 3 < s > iran is Backoff(iran is) 4 iran is Score Score log p ( < s > iran → is) = -1.1 log p (of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p (iran is → of) = -4.3 Heafield KenLM: Faster and Smaller Language Model Queries

  14. Backoff Models Data Structures State Results Stateful Query Pattern log p ( < s > → iran) = -3.3 log p ( < s > iran → is ) = -1.1 log p ( iran is → one ) = -2.0 log p ( is one → of ) = -0.3 Heafield KenLM: Faster and Smaller Language Model Queries

  15. Backoff Models Data Structures State Results Stateful Query Pattern Backoff( < s > ) log p ( < s > → iran) = -3.3 Backoff(iran), Backoff( < s > iran) log p ( < s > iran → is ) = -1.1 Backoff(is), Backoff(iran is) log p ( iran is → one ) = -2.0 Backoff(one), Backoff(is one) log p ( is one → of ) = -0.3 Backoff(of), Backoff(one of) Heafield KenLM: Faster and Smaller Language Model Queries

  16. Backoff Models Probing Data Structures Trie Results Chop Data Structures Probing Fast. Uses hash tables. Trie Small. Uses sorted arrays. Chop Smaller. Trie with compressed pointers. Key Subproblem Sparse lookup: efficiently retrieve values for sparse keys Heafield KenLM: Faster and Smaller Language Model Queries

  17. Backoff Models Probing Data Structures Trie Results Chop Sparse Lookup Speed � 100 Lookups/ µ s 10 probing hash set unordered 1 interpolation binary search set 10 7 10 1000 100000 Entries Heafield KenLM: Faster and Smaller Language Model Queries

  18. Backoff Models Probing Data Structures Trie Results Chop Sparse Lookup Speed � 100 Lookups/ µ s 10 probing hash set unordered 1 interpolation binary search set 10 7 10 1000 100000 Entries Heafield KenLM: Faster and Smaller Language Model Queries

  19. Backoff Models Probing Data Structures Trie Results Chop Linear Probing Hash Table Store 64-bit hashes and ignore collisions. Bigrams Words Hash log p Back < s > iran 0xf0ae9c2442c6920e -3.3 -1.2 iran is -1.7 -0.4 0x959e48455f4a2e90 is one 0x186a7caef34acf16 -2.0 -0.9 one of -1.4 -0.6 0xac66610314db8dac Heafield KenLM: Faster and Smaller Language Model Queries

  20. Backoff Models Probing Data Structures Trie Results Chop Linear Probing Hash Table 1.5 buckets/entry (so buckets = 6). Ideal bucket = hash mod buckets. Resolve bucket collisions using the next free bucket. Bigrams Words Ideal Hash log p Back iran is 0 0x959e48455f4a2e90 -1.7 -0.4 0x0 0 0 is one 2 0x186a7caef34acf16 -2.0 -0.9 one of 2 0xac66610314db8dac -1.4 -0.6 < s > iran 4 0xf0ae9c2442c6920e -3.3 -1.2 0x0 0 0 Array Heafield KenLM: Faster and Smaller Language Model Queries

  21. Backoff Models Probing Data Structures Trie Results Chop Probing Data Structure Unigrams Bigrams Trigrams Words log p Back Words log p Back Words log p < s > - ∞ -2.0 < s > iran -3.3 -1.2 < s > iran is -1.1 iran -4.1 -0.8 iran is -1.7 -0.4 iran is one -2.0 is -2.5 -1.4 is one -2.0 -0.9 is one of -0.3 one -3.3 -0.9 one of -1.4 -0.6 Probing Hash Table of -2.5 -1.1 Probing Hash Table Array Heafield KenLM: Faster and Smaller Language Model Queries

  22. Backoff Models Probing Data Structures Trie Results Chop Probing Hash Table Summary Hash tables are fast. But memory is 24 bytes/entry. Next: Saving memory with Trie. Heafield KenLM: Faster and Smaller Language Model Queries

  23. Backoff Models Probing Data Structures Trie Results Chop Trie Uses Sorted Arrays Sort in suffix order. Unigrams Bigrams Trigrams Words log p Back Ptr Words log p Back Ptr Words log p < s > - ∞ -2.0 < s > iran -3.3 -1.2 < s > iran is -1.1 < s > one is iran -4.1 -0.8 iran is -1.7 -0.4 -2.3 is -2.5 -1.4 one is -2.3 -0.3 iran is one -2.0 < s > one < s > one of one -3.3 -0.9 -2.3 -1.1 -0.5 of -2.5 -1.1 is one -2.0 -0.9 is one of -0.3 one of -1.4 -0.6 Heafield KenLM: Faster and Smaller Language Model Queries

  24. Backoff Models Probing Data Structures Trie Results Chop Trie Sort in suffix order. Encode suffix using pointers. Unigrams Bigrams Trigrams Words log p Back Ptr Words log p Back Ptr Words log p < s > - ∞ -2.0 0 < s > iran < s > iran is -3.3 -1.2 0 -1.1 iran -4.1 -0.8 0 < s > is -2.9 -1.0 0 < s > one is -2.3 is -2.5 -1.4 1 iran is -1.7 -0.4 0 iran is one -2.0 one -3.3 -0.9 4 one is -2.3 -0.3 1 < s > one of -0.5 of -2.5 -1.1 6 < s > one -2.3 -1.1 2 is one of -0.3 7 is one -2.0 -0.9 2 Array Array one of -1.4 -0.6 3 5 Array Heafield KenLM: Faster and Smaller Language Model Queries

More recommend