Announcements Piazza started Matlab Grader homework, email Friday, 2 (of 9) homeworks Due 21 April, Binary graded. Jupyter homework?: translate matlab to Jupiter, TA Harshul h6gupta@eng.ucsd.edu or me I would like this to happen. “GPU” homework. NOAA climate data in Jupyter on the datahub.ucsd.edu, 15 April. Projects: Any language Podcast might work eventually. Today: Stanford CNN • Bernoulli • Gaussian 1.2 • Gaussian 2.3 • Decision theory 1.5 • Information theory 1.6 • Monday Stanford CNN, Linear models for regression 3
Non-parametric method K means E
Coin estimate (Bishop 2.1) p CH µ PA 4 Binary variables x={0,1} • I µ par D PCT p ( x = 1 | µ ) = µ EG fx pas di Bernoulli distributed • Bern( x | µ ) = µ x (1 − µ ) 1 − x (2.2) I µ O Ci µ fl E [ x ] = µ the Bernoulli distribution. It is easily verified that this distribution 2 Ver Kk Efx E var[ x ] = µ (1 − µ ) . D µ 2 l fl o M µ • N observations, Likelihood: | N N � � µ x n (1 − µ ) 1 − x n . p ( D| µ ) = p ( x n | µ ) = (2.5) n =1 n =1 N N l � � ln p ( D| µ ) = ln p ( x n | µ ) = { x n ln µ + (1 − x n ) ln(1 − µ ) } . (2.6) E F n =1 n =1 the Ifn t Max likelihood • l µ N µ ML = 1 � x n N n =1
Coin estimate (Bishop 2.1) like post prior Bayes p(x|y)=p(y|x)p(x) • Beta( µ | a, b ) = Γ ( a + b ) Γ ( a ) Γ ( b ) µ a − 1 (1 − µ ) b − 1 Conjugate prior • 3 3 a = 1 a = 0 . 1 b = 0 . 1 b = 1 2 2 1 1 0 0 0 0.5 1 0 0.5 1 µ µ 3 3 a = 2 a = 8 b = 3 b = 4 2 2 1 1 0 0 0 0.5 1 0 0.5 1 µ µ Bayes: a 2 G l H 2 2 2 g prior likelihood function posterior I 1 1 1 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 µ µ µ
ML MAP BAYES ML point estimate • MAP point estimate (often in literature ML=MAP) • • Bayes => probability =>From which all information can be obtained – MAP, median, error estimates – Further analysis as sequential – Disadvantage… not a point estimate. 2 2 2 prior likelihood function posterior a 1 1 1 e 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 µ µ µ
Bayes Rule P ( hypothesis | data ) = P ( data | hypothesis ) P ( hypothesis ) P ( data ) Rev’d Thomas Bayes (1702–1761) • Bayes rule tells us how to do inference about hypotheses from data. • Learning and prediction can be seen as forms of inference.
The Gaussian Distribution Gaussian Mean and Variance
Gaussian Parameter Estimation Likelihood function L Een al o Maximum (Log) Likelihood
Curve Fitting Re-visited, Bishop1.2.5
Maximum Likelihood N � N � t n | y ( x n , w ) , β − 1 � p ( t | x , w , β ) = . (1.61) n =1 As we did in the case of the simple Gaussian distribution earlier, it is convenient to maximize the logarithm of the likelihood function. Substituting for the form of the Gaussian distribution, given by (1.46), we obtain the log likelihood function in the form N ln p ( t | x , w , β ) = − β { y ( x n , w ) − t n } 2 + N 2 ln β − N � 2 ln(2 π ) . (1.62) 2 n =1 Consider first the determination of the maximum likelihood solution for the polyno- N 1 = 1 { y ( x n , w ML ) − t n } 2 . � 6 (1.63) β ML N n =1 Giving estimates of W and beta, we can predict p ( t | x, w ML , β ML ) = N � t | y ( x, w ML ) , β − 1 � . (1.64) ML take a step towards a more Bayesian approach and introduce a prior
MAP: A Step towards Bayes 1.2.5 prior Nutt B l NG t In parte t ENT w Az WIN th f t cost 11Wh I Determine by minimizing regularized sum-of-squares error, . Regularized sum of squares
Predictive Distribution True data Estimated +/- std dev
Parametric Distributions Basic building blocks: Need to determine given Representation: or ? Recall Curve Fitting We focus on Gaussians!
The Gaussian Distribution i
Central Limit Theorem • The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. • Example: N uniform [0,1] random variables. r
Geometry of the Multivariate Gaussian Cx Mlt E x µ e C myth y
Moments of the Multivariate Gaussian (2) M Ect I A Gaussian requires D*(D-1)/2 +D parameters. 62 I e Often we use D +D or Just D+1 parameters. G o V ate
Partitioned Conditionals and Marginals, page 89 Conditional marginal Mmm S
ML for the Gaussian (1) Bisphop 2.3.4 Given i.i.d. data , the log likelihood function is given by l lap zcnl zltftrfzcx mfz tx MD N N m1Ct mFE z.cn EltftrCZcx zfnlzlttnnsgz N FEG mkt.mg ljsg I 2M trLAB4 treats JIE 54g 52 40 ∂ A − 1 � T ∂ A ln | A | = � (C.28) ∂ ∂ A Tr ( AB ) = B T . Sµ (C.24) z ∂ = − A − 1 ∂ A � A − 1 � ∂ x A − 1 (C.21) ∂ x
Maximum Likelihood for the Gaussian Set the derivative of the log likelihood function to zero, • and solve to obtain • Similarly •
Mixtures of Gaussians (Bishop 2.3.9) Old Faithful geyser: The time between eruptions has a bimodal distribution, with the mean interval being either 65 or 91 minutes, and is dependent on the length of the prior eruption. Within a margin of error of ±10 minutes, Old Faithful will erupt either 65 minutes after an eruption lasting less than 2 1 ⁄ 2 minutes, or 91 minutes after an eruption lasting more than 2 1 ⁄ 2 minutes. I I Single Gaussian Mixture of two Gaussians
Mixtures of Gaussians (Bishop 2.3.9) • Combine simple models into a complex model: I Component Mixing coefficient K=3
Mixtures of Gaussians (Bishop 2.3.9)
Mixtures of Gaussians (Bishop 2.3.9) Determining parameters p , µ , and S using maximum log likelihood • Log of a sum; no closed form maximum. Solution: use standard, iterative, numeric optimization methods or the • expectation maximization algorithm (Chapter 9). EM
Entropy 1.6 Important quantity in • coding theory • statistical physics • machine learning
Differential Entropy Put bins of width ¢ along the real line For fixed differential entropy maximized when in which case
The Kullback-Leibler Divergence P true distribution, q is approximating distribution
Recommend
More recommend