2 Ensembles and probabilities Probability, Entropy, and Inference • Ensemble X is a triple ( x, A X , P X ) , where Based on David J.C. MacKay: – x is the outcome of random variable Information Theory, Inference and Learning Algorithms, 2003 – A X = { a 1 , a 2 , . . . , a I } are the possible values for x Chapter 2 – P X = { p 1 , p 2 , . . . , p I } are the probabilities of outcomes P ( x = a i ) = p i – p i ≥ 0 – � a i ∈A X P ( x = a i ) = 1 Juha Raitio juha.raitio@iki.fi • P ( x = a i ) may be written as P ( a i ) or P ( x ) 5th February 2004 • Probability of a subset T of A x � P ( T ) = P ( x ∈ T ) = P ( x = a i ) (1) a i ∈ T HUT T-61.182 Information Theory and Machine Learning Juha Raitio 5th February 2004 1 3 Outline Joint ensembles and marginal probabilities • Joint ensemble XY 1. On notation of probabilities – Outcome is an ordered pair x, y (or xy ) 2. Meaning of probability – Possible values A X = { a 1 , a 2 , . . . , a I } and A Y = { b 1 , b 2 , . . . , b J } – Joint probability P ( x, y ) 3. Forward and inverse probabilities • Marginal probabilities 4. Probabilistic inference � P ( x = a i ) ≡ P ( x = a i , y ) (2) 5. Shannon information and entropy y ∈A Y � P ( y ) ≡ P ( y, x ) (3) 6. On convexity of functions x ∈A X 7. Exercises Juha Raitio 5th February 2004 Juha Raitio 5th February 2004
4 6 Conditioning rules Two meanings for probability • Conditional probability P ( x = a i | y = b j ) ≡ P ( x = a i , y = b j ) • Frequentist view of probability P ( y = b j ) � = 0 , (4) P ( y = b j ) – Probabilities are frequencies of outcomes in random experiments – Probabilities describe random variables • Assumptions H – ”the probability that x equals a i , given H ” • Bayesian view of probability – Probabilities are degrees of belief in propositions • Product (chain) rule – Probabilities describe assumptions, and inferences given assumptions – Subjective intepretation of probability P ( x, y |H ) = P ( x | y, H ) P ( y |H ) = P ( y | x, H ) P ( x |H ) (5) “you cannot do inference without making assumptions” • Sum rule � � P ( x |H ) = P ( x, y |H ) = P ( x | y, H ) P ( y |H ) (6) y y Juha Raitio 5th February 2004 Juha Raitio 5th February 2004 5 7 Bayes theorem and independence Forward and inverse probabilities • Bayes theorem • Assume generative model describing a process giving rise to some data P ( x | y, H ) P ( y |H ) • Forward probability P ( y | x, H ) = (7) P ( x |H ) – Task is to compute probability distribution of some quantity that depends on data P ( x | y, H ) P ( y |H ) = (8) � y ′ P ( x | y ′ , H ) P ( y ′ |H ) • Inverse probability – Task is to compute probability distribution of unobserved variables given data • Two random variables X and Y are independent ( X ⊥ Y ) if and only if – Requires use of Bayes’ theorem P ( x, y ) = P ( x ) P ( y ) (9) Juha Raitio 5th February 2004 Juha Raitio 5th February 2004
8 10 Inference with inverse probabilities Decomposability of entropy • Inference on parameters θ given data D and hypothesis H by Bayes’ theorem • Entropy of probability distribution p = { p 1 , p 2 , . . . , p I } P ( θ | D, H ) = P ( D | θ, H ) P ( θ |H ) � � p 2 p 3 p I , (10) H ( p ) = H ( p 1 , 1 − p 1 ) + (1 − p 1 ) H , , . . . , (15) P ( D |H ) 1 − p 1 1 − p 1 1 − p 1 where • More generally P ( θ |H ) is the prior probability for parameters H ( p ) = H [( p 1 + p 2 + . . . + p m ) , ( p m +1 + p m +2 + . . . + p I )] P ( D | θ, H ) is the likelihood of the parameters given the data P ( D |H ) is the evidence � p 1 p m � +( p 1 + . . . + p m ) H ( p 1 + . . . + p m ) , . . . , (16) P ( θ | D, H ) is the posterior probability for parameters ( p 1 + . . . + p m ) � � p m +1 p I +( p m +1 + . . . + p I ) H ( p m +1 + . . . + p I ) , . . . , • in written ( p m +1 + . . . + p I ) posterior = likelihood × prior (11) evidence Juha Raitio 5th February 2004 Juha Raitio 5th February 2004 9 11 Shannon information and entropy Relative entropy • Shannon information content of an outcome x = a i (bits) • Kullback-Leibler divergence between P ( x ) and Q ( x ) over alphabet A X 1 h ( x = a i ) = log 2 (12) P ( x ) log P ( x ) � P ( x = a i ) D KL ( P � Q ) = (17) Q ( x ) x • Entropy of an ensemble X (bits) • Properties of relative entropy 1 � H ( X ) ≡ P ( x ) log (13) – Gibbs’ inequality : D KL ( P � Q ) ≥ 0 and D KL ( P � Q ) = 0 , if P = Q P ( x ) x ∈A X – in general D KL ( P � Q ) � = D KL ( Q � P ) • Joint entropy of X, Y 1 � H ( X, Y ) ≡ P ( x, y ) log (14) P ( x, y ) xy ∈A X A Y Juha Raitio 5th February 2004 Juha Raitio 5th February 2004
12 Convex and concave functions • f ( x ) is convex over ( a, b ) , if for all x 1 , x 2 ∈ ( a, b ) and 0 ≤ λ ≤ 1 f ( λx 1 + (1 − λ ) x 2 ) ≤ λf ( x 1 ) + (1 − λ ) f ( x 2 ) (18) • f ( x ) is concave is the above above holds for f with the inequities reversed • f ( x ) is strictly convex (concave) if the equality in (18) holds only for λ = 0 and λ = 1 • Jensen’s inequality for convex function f ( x ) of random variable x E [ f ( x )] ≥ f ( E [ x ]) , where E denotes expectation (19) • If f ( x ) is convex (concave) and ∇ f ( x ) = 0 , then f has its minimum (maximum) value at x Juha Raitio 5th February 2004 13 Problems 1. A circular coin of diameter a is thrown onto a square grid whose squares are b × b , ( a < b ) . What is the probability that the coin will lie entirely within one square? (MacKay exercise 2 . 31 ) 2. The inhabitants of an island tell the truth one third of the time. They lie with probability 2 / 3 . On an occasion, after one of them made a statement, you ask another ’was the statement true?’ and he says ’yes’. What is the probability that the statement was indeed true? (MacKay exercise 2 . 37 ) Juha Raitio 5th February 2004
Recommend
More recommend