statistical natural language processing
play

Statistical Natural Language Processing A refresher on information - PowerPoint PPT Presentation

Statistical Natural Language Processing A refresher on information theory ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 Information theory Information theory storage and transmission of


  1. Statistical Natural Language Processing A refresher on information theory Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017

  2. Information theory Information theory storage and transmission of information many difgerent fjelds NLP Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 1 / 19 • Information theory is concerned with measurement, • It has its roots in communication theory, but is applied to • We will revisit some of the major concepts

  3. Information theory Noisy channel model Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, including in speech recognition and machine translations able to detect and correct errors the channel bandwidth 2 / 19 channel noisy 10010010 10000010 a decoder encoder a • We want codes that are effjcient: we do not want to waste • We want codes that are resilient to errors: we want to be • This simple model has many applications in NLP,

  4. Information theory b Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Coding example c 3 / 19 Can we do even better? binary coding of an eight-letter alphabet one-hot representation a one-hot coding? letter code 00000001 • We can encode an 8-letter 00000010 alphabet with 8 bits using 00000100 00001000 • Can we do better than 00010000 00100000 01000000 10000000

  5. Information theory b Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Coding example c 3 / 19 Can we do even better? binary coding of an eight-letter alphabet one-hot representation a one-hot coding? letter code 00000000 • We can encode an 8-letter 00000001 alphabet with 8 bits using 00000010 00000011 • Can we do better than 00000100 00000101 00000110 00000111

  6. Information theory b Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Coding example c 3 / 19 a binary coding of an eight-letter alphabet one-hot representation one-hot coding? code letter 00000000 • We can encode an 8-letter 00000001 alphabet with 8 bits using 00000010 00000011 • Can we do better than 00000100 00000101 • Can we do even better? 00000110 00000111

  7. Information theory Self information / surprisal Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, content 4 / 19 Self information (or surprisal ) associated with an event x is 1 I ( x ) = log P ( x ) = − log P ( x ) • If the event is certain, the information (or surprise) associated with it is 0 • Low probability (surprising) events have higher information • Base of the log determines the unit of information 2 bits e nats 10 dit, ban, hartley

  8. Information theory Why log? linear relations possible outcomes exponentially – The possible number of strings you can fjt into two pages is exponentially more than one page – But we expect information to double, not increase exponentially computationally more suitable Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 5 / 19 • Reminder: logarithms transform exponential relations to • In most systems, linear increase in capacity increases • Working with logarithms is mathematically and

  9. Information theory Entropy Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, while self information is about individual events Note: entropy is about a distribution, sum with integral) 6 / 19 Entropy is a measure of the uncertainty of a random variable: ∑ H ( X ) = − P ( x ) log P ( x ) x • Entropy is the lower bound on the best average code length, given the distribution P that generates the data • Entropy is average surprisal: H ( X ) = E [− log P ( x )] • It generalizes to continuous distributions as well (replace

  10. Information theory Example: entropy of a Bernoulli distribution Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 7 / 19 1 0 . 8 0 . 6 H ( X ) in bits 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 P ( X = 1 )

  11. Information theory Entropy: demonstration Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 8 / 19 increasing number of outcomes increases entropy H = − 1 H = − 1 4 − 1 1 H = − log 1 = 0 1 2 − 1 1 4 − 1 1 4 − 1 1 1 2 = 1 4 = 2 4 log 2 4 log 2 2 log 2 4 log 2 2 log 2 4 log 2

  12. Information theory Entropy: demonstration Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 8 / 19 increasing number of outcomes increases entropy H = − 1 H = − 1 4 − 1 1 H = − log 1 = 0 1 2 − 1 1 4 − 1 1 4 − 1 1 1 2 = 1 4 = 2 4 log 2 4 log 2 2 log 2 4 log 2 2 log 2 4 log 2

  13. Information theory Entropy: demonstration Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 8 / 19 increasing number of outcomes increases entropy H = − 1 H = − 1 4 − 1 1 H = − log 1 = 0 1 2 − 1 1 4 − 1 1 4 − 1 1 1 2 = 1 4 = 2 4 log 2 4 log 2 2 log 2 4 log 2 2 log 2 4 log 2

  14. Information theory ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Entropy: demonstration 9 / 19 the distribution matters H = − 3 H = − 1 H = − 1 4 − 1 3 2 − 1 1 1 4 − 1 16 − 1 1 1 8 − 1 1 4 − 1 16 − 1 1 8 − 1 1 4 − 1 1 1 1 1 8 = 1 . 47 4 = 2 16 = 0 . 97 4 log 2 2 log 2 4 log 2 16 log 2 8 log 2 4 log 2 16 log 2 8 log 2 4 log 2 8 log 2 16 log 2 4 log 2

  15. Information theory ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Entropy: demonstration 9 / 19 the distribution matters H = − 3 H = − 1 H = − 1 4 − 1 3 2 − 1 1 1 4 − 1 16 − 1 1 1 8 − 1 1 4 − 1 16 − 1 1 8 − 1 1 4 − 1 1 1 1 1 8 = 1 . 47 4 = 2 16 = 0 . 97 4 log 2 2 log 2 4 log 2 16 log 2 8 log 2 4 log 2 16 log 2 8 log 2 4 log 2 8 log 2 16 log 2 4 log 2

  16. Information theory ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Entropy: demonstration 9 / 19 the distribution matters H = − 3 H = − 1 H = − 1 4 − 1 3 2 − 1 1 1 4 − 1 16 − 1 1 1 8 − 1 1 4 − 1 16 − 1 1 8 − 1 1 4 − 1 1 1 1 1 8 = 1 . 47 4 = 2 16 = 0 . 97 4 log 2 2 log 2 4 log 2 16 log 2 8 log 2 4 log 2 16 log 2 8 log 2 4 log 2 8 log 2 16 log 2 4 log 2

  17. Information theory ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Entropy: demonstration 9 / 19 the distribution matters H = − 3 H = − 1 H = − 1 4 − 1 3 2 − 1 1 1 4 − 1 16 − 1 1 1 8 − 1 1 4 − 1 16 − 1 1 8 − 1 1 4 − 1 1 1 1 1 8 = 1 . 47 4 = 2 16 = 0 . 97 4 log 2 2 log 2 4 log 2 16 log 2 8 log 2 4 log 2 16 log 2 8 log 2 4 log 2 8 log 2 16 log 2 4 log 2

  18. Information theory letter Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Back to coding letters b a code prob c maximum entropy. bits, we No. bits, we need bits on average If the probabilities were difgerent, could we do maximum uncertainty, hence the Yes. Now better? 10 / 19 need Uniform distribution has the bits on average • Can we do better? 1 000 8 1 001 8 1 010 8 1 011 8 1 100 8 1 101 8 1 110 8 1 111 8

  19. Information theory prob Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Back to coding letters c b a code 10 / 19 letter better? Yes. Now difgerent, could we do If the probabilities were bits on average bits, we need bits on average Uniform distribution has the maximum uncertainty, hence the maximum entropy. • Can we do better? • No. H = 3 bits, we need 3 1 000 8 1 001 8 1 010 8 1 011 8 1 100 8 1 101 8 1 110 8 1 111 8

  20. Information theory prob Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d c Back to coding letters b a code 10 / 19 letter better? maximum entropy. Yes. Now bits on average bits, we need bits on average Uniform distribution has the maximum uncertainty, hence the difgerent, could we do • Can we do better? • No. H = 3 bits, we need 3 1 2 1 • If the probabilities were 4 1 8 1 16 1 64 1 64 1 64 1 64

  21. Information theory code Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Back to coding letters c b a 10 / 19 prob better? difgerent, could we do bits on average Uniform distribution has the maximum uncertainty, hence the letter maximum entropy. • Can we do better? • No. H = 3 bits, we need 3 1 0 2 1 • If the probabilities were 10 4 1 110 8 1 1110 • Yes. Now H = 2 bits, we 16 1 need 2 bits on average 111100 64 1 111101 64 1 111110 64 1 111111 64

  22. Information theory Entropy of your random numbers Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, entropy would be, If it was uniformly distributed the 11 / 19 0 . 11 20 • Entropy of the distribution: 19 0 . 04 0 . 07 18 17 0 . 21 H = −(+ 0 . 04 × log 2 0 . 04 0 . 07 16 + 0 . 11 × log 2 0 . 11 15 0 . 04 0 . 04 14 + 13 0 . 07 . . . 0 12 + 0 . 11 × log 2 0 . 11 ) 11 0 0 10 = 3 . 63 9 0 . 04 0 8 7 0 . 07 0 . 04 6 5 0 . 04 0 . 04 4 3 0 0 . 11 2 1 0 . 04 0 0 . 1 0 . 2

Recommend


More recommend