Probability and Information Theory Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26
Probability Mass Function • The domain of P must be the set of all possible states of x. • ∀ x ∈ x , 0 ≤ P ( x ) ≤ 1 . An impossible event has probability 0 and no state can be less probable than that. Likewise, an event that is guaranteed to happen has probability 1 , and no state can have a greater chance of occurring. • P x ∈ x P ( x ) = 1 . We refer to this property as being normalized . Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring. Example: uniform distribution: P ( x = x i ) = 1 k (Goodfellow 2016)
Probability Density Function • The domain of p must be the set of all possible states of x. • 8 x 2 x , p ( x ) � 0 . Note that we do not require p ( x ) 1 . R p ( x ) dx = 1 . • mass outside the 1 u ( x ; a, b ) = b − a . Example: uniform distribution: integrates to 1. W (Goodfellow 2016)
Computing Marginal Probability with the Sum Rule X 8 x 2 x , P ( x = x ) = P ( x = x, y = y ) . (3.3) y Z p ( x ) = p ( x, y ) dy. (3.4) (Goodfellow 2016)
Conditional Probability P ( y = y | x = x ) = P ( y = y, x = x ) (3.5) . P ( x = x ) (Goodfellow 2016)
Chain Rule of Probability i =2 P ( x ( i ) | x (1) , . . . , x ( i − 1) ) . P ( x (1) , . . . , x ( n ) ) = P ( x (1) ) Π n (3.6) (Goodfellow 2016)
Independence ∀ x ∈ x , y ∈ y , p ( x = x, y = y ) = p ( x = x ) p ( y = y ) . (3.7) ndom variables x and y are given a random (Goodfellow 2016)
Conditional Independence ∀ x ∈ x , y ∈ y , z ∈ z , p ( x = x, y = y | z = z ) = p ( x = x | z = z ) p ( y = y | z = z ) . (3.8) We can denote independence and conditional independence with compact (Goodfellow 2016)
Expectation X E x ∼ P [ f ( x )] = P ( x ) f ( x ) , (3.9) x Z E x ∼ p [ f ( x )] = p ( x ) f ( x ) dx. (3.10) linearity of expectations: E x [ α f ( x ) + β g ( x )] = α E x [ f ( x )] + β E x [ g ( x )] , (3.11) (Goodfellow 2016)
Variance and Covariance h ( f ( x ) − E [ f ( x )]) 2 i Var( f ( x )) = E (3.12) . Cov( f ( x ) , g ( y )) = E [( f ( x ) − E [ f ( x )]) ( g ( y ) − E [ g ( y )])] . (3.13) Covariance matrix: Cov( x ) i,j = Cov( x i , x j ) . (3.14) f the covariance give the variance: (Goodfellow 2016)
Bernoulli Distribution P ( x = 1) = φ (3.16) P ( x = 0) = 1 − φ (3.17) P ( x = x ) = φ x (1 − φ ) 1 − x (3.18) E x [ x ] = φ (3.19) Var x ( x ) = φ (1 − φ ) (3.20) (Goodfellow 2016)
Gaussian Distribution Parametrized by variance: r ✓ ◆ 1 − 1 N ( x ; µ, σ 2 ) = 2 σ 2 ( x − µ ) 2 2 πσ 2 exp (3.21) . Parametrized by precision: 3.1 for a plot of the density function. r ✓ ◆ − 1 β N ( x ; µ, β � 1 ) = 2 β ( x − µ ) 2 (3.22) 2 π exp . (Goodfellow 2016)
Gaussian Distribution 0 . 40 0 . 35 Maximum at x = µ 0 . 30 0 . 25 Inflection points at p(x) x = µ ± σ 0 . 20 0 . 15 0 . 10 0 . 05 0 . 00 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Figure 3.1 (Goodfellow 2016)
Multivariate Gaussian Parametrized by covariance matrix: s 1 ✓ − 1 ◆ 2( x − µ ) > Σ � 1 ( x − µ ) N ( x ; µ , Σ ) = (2 π ) n det( Σ ) exp (3.23) . Parametrized by precision matrix: s ✓ ◆ det( β ) − 1 N ( x ; µ , β � 1 ) = 2( x − µ ) > β ( x − µ ) (2 π ) n exp (3.24) . (Goodfellow 2016)
More Distributions Exponential: (3.25) p ( x ; λ ) = λ 1 x � 0 exp ( − λ x ) . ution uses the indicator function to assign probability Laplace: ✓ − | x − µ | ◆ Laplace( x ; µ, γ ) = 1 2 γ exp (3.26) . γ Dirac: p ( x ) = δ ( x − µ ) . (3.27) (Goodfellow 2016)
Empirical Distribution m p ( x ) = 1 X δ ( x − x ( i ) ) ˆ (3.28) m i =1 (Goodfellow 2016)
Mixture Distributions X P ( x ) = P ( c = i ) P ( x | c = i ) (3.29) i Gaussian mixture with three components x 2 x 1 Figure 3.2 (Goodfellow 2016)
Logistic Sigmoid 1 . 0 0 . 8 0 . 6 σ ( x ) 0 . 4 0 . 2 0 . 0 − 10 − 5 0 5 10 Figure 3.3: The logistic sigmoid function. Commonly used to parametrize Bernoulli distributions (Goodfellow 2016)
Softplus Function 10 8 6 ζ ( x ) 4 2 0 − 10 − 5 0 5 10 Figure 3.4: The softplus function. (Goodfellow 2016)
Bayes’ Rule P ( x | y ) = P ( x ) P ( y | x ) (3.42) . P ( y ) appears in the formula, it is usually feasible to compute (Goodfellow 2016)
Change of Variables � ◆� ✓ ∂ g ( x ) � � p x ( x ) = p y ( g ( x )) � det (3.47) � . � � ∂ x (Goodfellow 2016)
Information Theory Information: I ( x ) = − log P ( x ) . (3.48) Entropy: H ( x ) = E x ∼ P [ I ( x )] = � E x ∼ P [log P ( x )] . (3.49) KL divergence: � log P ( x ) D KL ( P k Q ) = E x ∼ P = E x ∼ P [log P ( x ) � log Q ( x )] . (3.50) Q ( x ) (Goodfellow 2016)
Entropy of a Bernoulli Variable 0 . 7 0 . 6 Shannon entropy in nats 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Bernoulli parameter Figure 3.5 (Goodfellow 2016)
The KL Divergence is Asymmetric q ∗ = argmin q D KL ( p � q ) q ∗ = argmin q D KL ( q � p ) p ( x ) p( x ) Probability Density Probability Density q ∗ ( x ) q ∗ ( x ) x x Figure 3.6 (Goodfellow 2016)
Directed Model b a Figure 3.7 d c e p ( a , b , c , d , e ) = p ( a ) p ( b | a ) p ( c | a , b ) p ( d | b ) p ( e | c ) . (3.54) (Goodfellow 2016)
Undirected Model b a d c Figure 3.8 e p ( a , b , c , d , e ) = 1 Z φ (1) ( a , b , c ) φ (2) ( b , d ) φ (3) ( c , e ) . (3.56) (Goodfellow 2016)
Recommend
More recommend