Bayesian networks: basic parameter learning Machine Intelligence Thomas D. Nielsen September 2008 Basic parameter learning September 2008 1 / 24
Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Basic parameter learning September 2008 2 / 24
Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. true mass random error Basic parameter learning September 2008 2 / 24
Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Basic parameter learning September 2008 2 / 24
Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Estimate of true mass: mean value of normal distribution that best “fits” the data. Basic parameter learning September 2008 2 / 24
Estimation Example: Coin Tossing Is the Euro fair? Basic parameter learning September 2008 3 / 24
Estimation Example: Coin Tossing Is the Euro fair? Toss Euro 1000 times and count number of heads and tails: . . . Basic parameter learning September 2008 3 / 24
Estimation Example: Coin Tossing Is the Euro fair? Result: heads: 521, tails: 479. Probability of Euro falling heads (estimate): value that best “fits” the data: 521/1000. Basic parameter learning September 2008 3 / 24
Estimation: Classical Structure of Estimation Problem Given: Data produced by some random process that is characterized by one or several numerical parameters. Wanted: Infer value of (some) parameters. (Classical) Method: Obtain estimate for parameter by a function that maps possible data sets into the parameter space. Basic parameter learning September 2008 4 / 24
Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Basic parameter learning September 2008 5 / 24
Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Example 2 : W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | X p i = 1 } . P θ : distribution with P ( w i ) = p i . Basic parameter learning September 2008 5 / 24
Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Example 2 : W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | X p i = 1 } . P θ : distribution with P ( w i ) = p i . Example 3 : W = R , Θ = R × R + . For θ = ( µ, σ ) ∈ Θ : P θ normal distribution with mean µ and standard deviation σ . Basic parameter learning September 2008 5 / 24
Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . Basic parameter learning September 2008 6 / 24
Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . A sample s 1 , . . . , s N ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P ( X i ) = P θ . Basic parameter learning September 2008 6 / 24
Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . A sample s 1 , . . . , s N ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P ( X i ) = P θ . Likelihood Function Given a parametric family { P θ | θ ∈ Θ } of distributions on W , and a sample s = ( s 1 , . . . , s N ) ∈ W N . The function N N θ �→ P θ ( s ) := P θ ( s i ) , resp. θ �→ log P θ ( s ) = log P θ ( s i ) , Y X i = 1 i = 1 is called the likelihood function (resp. log-likelihood function) for θ given s . Basic parameter learning September 2008 6 / 24
Estimation: Classical Maximum Likelihood Estimator Given: parametric family and sample s . Every θ ∗ ∈ Θ with θ ∗ = arg max θ ∈ Θ P θ ( s ) is called a maximum likelihood estimate for θ (given s ). Basic parameter learning September 2008 7 / 24
Estimation: Classical Maximum Likelihood Estimator Given: parametric family and sample s . Every θ ∗ ∈ Θ with θ ∗ = arg max θ ∈ Θ P θ ( s ) is called a maximum likelihood estimate for θ (given s ). Since the logarithm is a strictly monotone function, maximum likelihood estimatates are also obtained by maximizing the log-likelihood: θ ∗ = arg max θ ∈ Θ log P θ ( s ) Basic parameter learning September 2008 7 / 24
Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model Basic parameter learning September 2008 8 / 24
Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 2 M 0 . 3 M 0 . 1 Model We can measure how well a model fits the data using: P ( D| M θ ) = P ( pinup , pinup , pindown , . . . , pinup | M θ ) = P ( pinup | M θ ) P ( pinup | M θ ) P ( pindown | M θ ) · . . . · P ( pinup | M θ ) This is also called the likelihood of M θ given D . Basic parameter learning September 2008 8 / 24
Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We select the parameter ˆ θ that maximizes: ˆ P ( D| M θ ) θ = arg max θ 100 Y P ( d i | M θ ) = arg max θ i = 1 µ · θ 80 ( 1 − θ ) 20 . = arg max θ Basic parameter learning September 2008 8 / 24
Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model By setting: d d θ µ · θ 80 ( 1 − θ ) 20 = 0 we get the maximum likelihood estimate: ˆ θ = 0 . 8 . Basic parameter learning September 2008 8 / 24
Estimation: Classical Maximum Likelihood Estimates for Multinomial Distribution Consider the family of multinomial distribution defined as W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | p i = 1 } . X P θ : distribution with P ( w i ) = p i . For { P θ | θ ∈ Θ } and s ∈ W N : there exists exactly one maximum likelihood estimate θ ∗ = ( p ∗ 1 , . . . , p ∗ k ) given by i = 1 p ∗ N |{ j ∈ { 1 , . . . , N } | s j = w i }| [i.e. θ ∗ is just the empirical distribution defined by the data on W ] Basic parameter learning September 2008 9 / 24
Estimation: Classical Proof: (for W = { w 1 , w 2 } ): p ∗ 1 / N |{ j ∈ { 1 , . . . , N } | s j = w 1 }| = 1 p ∗ 1 / N |{ j ∈ { 1 , . . . , N } | s j = w 2 }| (= 1 − p ∗ = 1 ) 2 N log P θ ( s ) = X log P θ ( s j ) = N · ( p ∗ 1 log ( p 1 ) + p ∗ 2 log ( p 2 )) j = 1 = N · ( p ∗ 1 log ( p 1 ) + ( 1 − p ∗ 1 ) log ( 1 − p 1 )) Differentiated w.r.t. p 1 : N · ( p ∗ 1 / p 1 − ( 1 − p ∗ 1 ) / ( 1 − p 1 )) Only root: p 1 = p ∗ 1 . Basic parameter learning September 2008 10 / 24
Estimation: Classical Consistency Let W = { w 1 , . . . , w k } , and the data s 1 , s 2 , . . . , s N be generated by the distribution P θ with parameters θ = ( p 1 , . . . , p k ) . Then for all ǫ > 0 and i = 1 , . . . , k : N →∞ P θ ( | p ∗ i − p i | ≥ ǫ ) = 0 lim Note: p ∗ is a function of s . The probability P θ ( | p ∗ i − p i | ≥ ǫ ) is the probability that by sampling from P θ a sample s will be obtained, so that for the p ∗ computed from s the inequality | p ∗ i − p i | ≥ ǫ holds. Similar consistency properties hold for many other types of maximum likelihood estimates. Basic parameter learning September 2008 11 / 24
Estimation: Classical Chebyshev’s Inequality A quantitative bound: 1 P θ ( | p ∗ i − p i | ≥ ǫ ) ≤ ǫ 2 N p i ( 1 − p i ) Basic parameter learning September 2008 12 / 24
Recommend
More recommend