mathematical foundations
play

Mathematical Foundations Foundations of Statistical Natural Language - PowerPoint PPT Presentation

Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2 Presented by Jen-Wei Kuo CSIE, NTNU rogerkuo@csie.ntnu.edu.tw Reference A First Course in Probability -Sheldon Ross Probability


  1. Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2 Presented by Jen-Wei Kuo (郭人瑋) CSIE, NTNU rogerkuo@csie.ntnu.edu.tw

  2. Reference • A First Course in Probability -Sheldon Ross • Probability and Random Processes for Electrical Engineering -Algerto Leon-Garcia 2

  3. Outline • Elementary Probability Theory – Probability spaces – Conditional probability and independence – Bayes’ theorem – Random variables – Expectation and variance – Joint and conditional distributions – Gaussian distributions • Essential Information Theory – Entropy – Joint entropy and conditional entropy – Mutual information 3 – Relative entropy or Kullback-Leibler divergence

  4. Essential Information Theory Entropy • Entropy measures the amount of information in a random variable. It is normally measured in bits. ∑ = − H ( X ) p ( x ) log p ( x ) 2 ∈ x X • We define = 0 log 0 0 2 4

  5. Essential Information Theory Entropy • Example: Suppose you are reporting the result of rolling an 8-sided die. Then the entropy is: 8 8 1 1 ∑ ∑ = − = − H ( X ) p ( i ) log p ( i ) log 8 8 = = i 1 i 1 1 = − = = log log 8 3 bits 8 5

  6. Essential Information Theory Entropy • Entropy 代表要傳遞這件事的平均資訊量,當 我們建立系統時,希望 Entropy 愈低愈好。 • 傳遞機率時,由於機率不會超過 1 ,故我們只 需傳遞分母的值即可。 6

  7. Essential Information Theory Entropy • Properties of Entropy: ∑ = − H ( X ) p ( x ) log p ( x ) 2 ∈ x X 1 ∑ = p ( x ) log 2 p ( x ) ∈ x X   1 =   E log   p ( x )   7

  8. Essential Information Theory Joint Entropy and Conditional Entropy • Joint Entropy: ∑ ∑ = − H ( X , Y ) p ( x , y ) log p ( x , y ) ∈ ∈ x X y Y • Conditional Entropy: ∑ ∑ = − H ( Y | X ) p ( y , x ) log p ( y | x ) ∈ ∈ x X y Y 8

  9. Essential Information Theory Joint Entropy and Conditional Entropy • Proof of Conditional Entropy: ∑ = = H ( Y | X ) p ( x ) H ( Y | X x ) ∈ x X −  ∑ ∑ = p ( x ) p ( y | x ) log p ( y | x )     ∈ ∈ x X y Y ∑ ∑ = − p ( y , x ) log p ( y | x ) ∈ ∈ x X y Y 9

  10. Essential Information Theory Joint Entropy and Conditional Entropy • Chain rule for Entropy: = + H ( X , Y ) H ( X ) H ( Y | X ) • Proof: ∑ ∑ = − H ( X , Y ) p ( x , y ) log p ( x , y ) ∈ ∈ x X y Y ( ) ∑ ∑ = − p ( x , y ) log p ( y | x ) p ( x ) ∈ ∈ x X y Y ( ) ∑ ∑ = − + p ( x , y ) log p ( y | x ) log p ( x ) ∈ ∈ x X y Y ∑ ∑ ∑ ∑ = − − p ( x , y ) log p ( y | x ) p ( x , y ) log p ( x ) ∈ ∈ ∈ ∈ x X y Y x X y Y 10 = + H ( Y | X ) H ( X )

  11. Essential Information Theory Mutual Information H ( X , Y ) H ( X | Y ) H ( Y | X ) I ( X ; Y ) H ( X ) H ( Y ) = − = − I ( X ; Y ) H ( X ) H ( X | Y ) H ( Y ) H ( Y | X )

  12. Essential Information Theory Mutual Information • This difference is called the mutual information between X and Y. • The amount of information one random variable contains about another. • It is 0 only when two variables are independent. 也就是說,兩個獨立事件的 mutual Information 為 0 。 12

  13. Essential Information Theory Mutual Information • How to simply calculate Mutual Information ? = − I ( X ; Y ) H ( X ) H ( X | Y ) = + − H ( X ) H ( Y ) H ( X , Y ) 1 1 ∑ ∑ ∑ = + + p ( x ) log p ( y ) log p ( x , y ) log p ( x , y ) p ( x ) p ( y ) x y x , y 1 1 ∑ ∑ ∑ = + + p ( x , y ) log p ( x , y ) log p ( x , y ) log p ( x , y ) p ( x ) p ( y ) x , y x , y x , y   1 1 1 ∑ = + − p ( x , y ) log log log   p ( x ) p ( y ) p ( x , y )   x , y p ( x , y ) ∑ = p ( x , y ) log 13 p ( x ) p ( y ) x , y

  14. Essential Information Theory Mutual Information • Define the pointwise mutual information between two particular points. p ( x , y ) = I ( x , y ) log p ( x ) p ( y ) This has sometimes been used as a measure of association between elements. 14

  15. Essential Information Theory Relative Entropy or Kullback-Leibler divergence • For two probability mass functions, p(x) , q(x) their relative entropy is given by: p ( x ) ∑ = D ( p || q ) p ( x ) log q ( x ) ∈ x X define 0 p = = ∞ 0 log 0 and p log q 0 15

  16. Essential Information Theory Relative Entropy or Kullback-Leibler divergence • 意義: It is the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q. • Some authors use the name “KL distance”, but note that relative entropy isn’t a metric (it doesn’t satisfy the triangle inequality) 16

  17. Essential Information Theory Relative Entropy or Kullback-Leibler divergence Properties of KL-divergence: = ∑ p ( x , y ) I ( X ; Y ) p ( x , y ) log p ( x ) p ( y ) x , y = D ( p ( x , y ) || p ( x ) p ( y )) Define the Conditional Relative Entropy: p ( y | x ) ∑ ∑ = D ( p ( y | x ) || q ( y | x )) p ( x ) p ( y | x ) log q ( y | x ) x y 17

  18. Essential Information Theory Relative Entropy or Kullback-Leibler divergence Properties of KL-divergence: = ∑ p ( x , y ) I ( X ; Y ) p ( x , y ) log p ( x ) p ( y ) x , y = D ( p ( x , y ) || p ( x ) p ( y )) Define the Conditional Relative Entropy: p ( y | x ) ∑ ∑ = D ( p ( y | x ) || q ( y | x )) p ( x ) p ( y | x ) log q ( y | x ) x y 18

  19. Essential Information Theory The noisy channel model W X Encoder Message from Input to channel A finite alphabet Channel p ( y | x ) W / Y Decoder Attempt to Output from channel reconstruct message based on output The noisy channel model 19

  20. Essential Information Theory The noisy channel model 0 0 1 1 A binary symmetric channel 20

  21. Essential Information Theory The noisy channel model Capacity: The channel capacity describes the rate at which one can transmit information through the channel with an arbitrarily low probability of being unable to recover the input from the output. = = − = − = − C max I ( X ; Y ) max H ( Y ) H ( Y | X ) H ( Y ) H ( p ) 1 H ( p ) p ( X ) p ( X ) < ≤ 0 C 1 = = ⇒ = if p 0 or p 1 C 1 1 = ⇒ = if p C 0 2 21

  22. Essential Information Theory The noisy channel model Application: (In speech recognition) Input: word sequences Output: observed speech signal P(input): probability of word sequences P(output|input): acoustic model ( channel prob.) Bayes’ theorem p ( i ) p ( o | i ) ˆ = = = I arg max p ( i | o ) arg max arg max p ( i ) p ( o | i ) p ( o ) i i i 22

  23. Essential Information Theory Cross entropy Cross entropy: The cross entropy between a random variable X with true probability distribution p(X) and another pmf q (normally a model of p) is given by: = + H ( X , q ) H ( X ) D ( p || q ) 1 p ( x ) ∑ ∑ = + p ( x ) log p ( x ) log p ( x ) q ( x ) ∈ ∈ x X x X   1 p ( x ) ∑ = + p ( x ) log log   p ( x ) q ( x )   ∈ x X   1 ∑ = p ( x ) log   q ( x )   ∈ x X ∑ = − p ( x ) log q ( x ) 23 ∈ x X

  24. Essential Information Theory Cross entropy Cross entropy of a language : suppose Language L = (X i ) ~ p(x) according to a model m by 1 ∑ = − H ( L , m ) lim p ( x ) log m ( x ) 1 n 1 n n → ∞ n x 1 n We cannot calculate this quantity without knowing p. But if we make certain assumptions that the language is ‘nice,’ then the cross entropy for the language can be calculated as: 1 = − H ( L , m ) lim log m ( x ) 1 n n → ∞ n 24

  25. Essential Information Theory Cross entropy Cross entropy of a language : We do not actually attempt to calculate the limit, but approximate it by calculating for a sufficiently large n: 1 ≈ − H ( L , m ) log m ( x ) 1 n n This measure is just the figure for our average surprise. Our goal will be to try to minimize this number. Because H(X) is fixed, this is equivalent to minimizing the relative entropy, which is a measure of how much our probability distribution departs from actual language use. 25

Recommend


More recommend