Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020 1 [2 points] Mixture of Bernoullis A mixture of Bernoullis model is like the Gaussian mixture model which we’ve discussed in this course. Each of the mixture components consists of a collection of independent Bernoulli random variables. In general, a mixture model assumes the data are generated by the following process: first we sample z , and then we sample the observables x from a distribution which depends on z , i.e. p ( x , z ) = p ( x | z ) p ( z ) (1) In mixture models, p ( z ) is always a multinomial distribution with parameter π = { π 1 , ..., π K } which are mixture weights satisfying K � π k = 1 , π k ≥ 0 (2) k =1 Consider a set of N binary random variable in a D -dimensional space x j , where j = 1 , ..., N , each of which is governed by a Bernoulli distribution with param- eter θ jk D � θ x ij jk (1 − θ jk ) (1 − x ij ) p ( x i | z i = k, θ ) = (3) j =1 We can write the generative model of a mixture model as K � π z k p ( z | π ) ∼ Multinoimal( π ) = k k =1 (4) K � k (1 − θ k ) (1 − x ) ] z k [ θ x p ( x | z, θ ) ∼ Bernoulli( θ ) = k =1 The second distribution is the mixture proportion and π k is the weight of k -th proportion. So the Bernoulli mixture model is given as K N D � � � θ x ij jk (1 − θ jk ) (1 − x ij ) p ( x ) = π k (5) k =1 i =1 j =1 1
• Show the associated directed graphical model and write down the incomplete- data log likelihood. The complete data log likelihood for this model can be written as N � K D � z ik � � � � � ln p ( x , z | π , θ ) = ln π k p ( x ij | θ jk ) (6) i =1 k =1 j =1 In order to drive the EM algorithm, we take the expectation of the com- plete data log-likelihood with respect to the posterior distribution of latent variable z . Write down Q ( ξ ; ξ (old) ), where ξ = { θ, π } and the posterior distribution of the latent variable z . • Derive the update for π and θ in the M-step for ML estimation in terms of E [ z ik ] and write down E [ z ik ]. • Consider a mixture distribution p ( x ) and show that K � E [ x ] = π k θ k k =1 (7) K � π k { Σ k + θ k θ T k } − E [ x ] E [ x ] T cov[ x ] = k =1 where Σ k = diag[ θ ki (1 − θ ki )]. Hint - Solve the second equation in a general case by adding and subtract- ing a term which is a function of E [ x | k ] = θ k . • We now consider a Bayesian model in which we impose priors on the parameters. We impose the natural conjugate priors, i.e., a Beta prior for each θ jk and a Dirichlet prior for π . p ( π | α ) ∼ Dir( α ) (8) p ( θ jk | a, b ) ∼ Beta( a, b ) show that the M-step for MAP estimation of a mixture Bernoullis is given by θ kj = ( � i E [ z ik ] x ij + a − 1) ( � i E [ z ik ]) + a + b − 2 (9) π k = ( � i E [ z ik ]) + α k − 1 N + � k α k − K Hint - For the maximization w.r.t. π k in the M-step, you need to use Lagrange multiplier to enforce the constraint about π . 2
2 [2 points] Variational Lower Bound for the mixture of Bernoullis In the mixture of Bernoullis, the multinomial distribution chooses the mixtures. One can assume the conditional probability of each observed component follows a Bernoulli distribution given as Equation 3. We have priors over π and θ as K p ( π | α ) = Γ( � k α k ) � π α k − 1 k � k Γ( α k ) (10) k =1 p ( θ jk | a, b ) = Γ( a + b ) Γ( a )Γ( b ) θ a − 1 jk (1 − θ jk ) b − 1 • If you consider a variational distribution which factorizes between the latent variables and the parameters then you must show how the lower bound has the following form L = E [ln P ( x | z , θ )] + E [ln P ( z | π )] + E [ln P ( π | α )] + E [ln P ( θ | a, b )] (11) − E [ln q ( z )] − E [ln q ( π )] − E [ln q ( θ )] • Let’s assume that the approximate distribution of parameters of the model has the following form q ( θ | η, ν ) ∼ Beta( η, ν ) q ( π | ρ ) ∼ Dir( ρ ) (12) q ( z k | τ k ) ∼ Cat( τ k ) Here ρ , τ , η and ν are variational parameters. Derive the variational update equations for the three variational distributions using the mean field approximation which should yield to N � ρ k = α + τ ik i =1 N � η jk = a + τ ik x ij i =1 N � ν jk = b + τ ik (1 − x ij ) (13) i =1 K D � � � τ ik ∝ exp ψ ( ρ k ) − ψ ( ρ k ′ ) + x ij [ ψ ( η jk ) − ψ ( η jk + ν jk )] k ′ =1 j =1 D � � + (1 − x ij )[ ψ ( ν jk ) − ψ ( η jk + ν jk )] j =1 3
Hint-Use following properties E q ( θ ) [ln θ jk ] = ψ ( η jk ) − ψ ( η jk + ν jk ) E q ( θ ) [ln(1 − θ jk )] = ψ ( ν jk ) − ψ ( η jk + ν jk ) K (14) � E q ( π ) [ln π k ] = ψ ( ρ k ) − ψ ( ρ k ′ ) k ′ =1 E q ( z k ) [ z k ] = τ k where ψ ( . ) is the digamma function. 3 [2 points] Kernel methods 1. The k-nearest neighbors classifier assigns a point x to the majority class of its k nearest neighbors in the training set. Assume that we use squared Euclidean distance to measure the distance to some point x n in the train- ing set, � x − x n � 2 . Reformulate this classifier for a nonlinear kernel k using the kernel trick. 2. The file circles.csv contains a toy dataset. Each example has two fea- tures that represent its coordinates ( x 1 , x 2 ) in 2D space. Points belong to one of 5 classes which correspond to different circles centered at the ori- gin. We would like to perform classification with an additional feature for the squared Euclidean distance to the origin. Write out the appropriate feature map φ (( x 1 , x 2 )) and kernel function k ( x , x ′ ). 3. Perform k-nearest neighbors classification with k = 15 using the kernel from (2) and the standard linear kernel. Compare accuracies over 10-fold cross validation. Which version gives better results? 4
4 [2 points] Support Vector Machine 1. Recall the formulation of soft-margin (linear) SVM: n 1 2 � w � 2 + C � argmin w,b,ξ ξ i i =1 (15) w T x ( i ) + b s.t. y ( i ) � � ≥ 1 − ξ i , i = 1 , . . . , n ξ i ≥ 0 , i = 1 , . . . , n During lecture, support vector machine is introduced geometrically as find- ing the Max-Margin Classifier . While this geometric interpretation pro- vides useful intuition about how SVM works, it is hard to relate to other machine learning algorithms such as Logistic Regression. In this exercise, we show that soft-margin SVM is equivalent to minimizing a loss function (to be specific, the hinge loss ) with L2-regularization. And thus connect it to logistic regression and the goal of binary classification. The hinge loss is defined as V ( y, f ( x )) = (1 − yf ( x )) + where ( s ) + = max( s, 0). Show that � n � 1 � V ( y i , f ( x i )) + λ � w � 2 argmin (16) 2 n w,b i =1 is equivalent to formulation (15) for some C , where f ( x ) = w T x + b ; what is the corresponding C (in terms of n and λ )? 2. In the previous question, we chose V ( y, f ( x )) = (1 − yf ( x )) + (the hinge loss ) as our loss function; however, there are other reasonable loss func- tions that we can choose. For example, we can choose V ( y, f ( x )) = 1 � 1 + e − yf ( x ) � log(2) log which is usually called the logistic loss ; and � 0 , yf ( x ) ≥ 0 V ( y, f ( x )) = which is called the 0-1 loss . 1 , yf ( x ) < 0 Please plot the above three loss functions in one figure, with yf ( x ) as the horizontal axis and V ( y, f ( x )) as the vertical axis. Explain your observa- tion. 3. [Bonus] Answer the following questions as precisely as you can. What is 1 � 1 + e − yf ( x ) � (16) if we choose V ( y, f ( x )) = log(2) log ? What is (16) if we � 0 , yf ( x ) ≥ 0 choose V ( y, f ( x )) = ? (Long answers receive no score) 1 , yf ( x ) < 0 5
Recommend
More recommend