人工智能引论 2018 罗智凌 人工智能引论 ( 三 ) Intro. on Artificial Intelligence from the perspective of probability theory 罗智凌 luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net
人工智能引论 2018 罗智凌 OUTLINE • Strategies • Algorithm • Applications
人工智能引论 2018 罗智凌 Strategies • Loss in objective function • 0-1 loss • Quadratic loss • Absolute loss • Logarithmic loss (log-likelihood loss) – MLE – MAP
人工智能引论 2018 罗智凌 Generative/ Discriminative Model P5 P30 G Prob • Generating Procedure: Y Y Y 0.173 • P5 ~ b( 𝛽 ) Y Y N 0.075 • P30~ b( 𝛾 ) Y N Y 0.116 • 𝜄 ~ Multi(P5, P30, 𝛿 ) Y N N 0.121 N Y Y 0.075 • G~ b( 𝜄 ) N Y N 0.127 𝛿 N N Y 0.179 𝛽 N N N 0.133 P5 Generative model G 𝜄 𝑄 𝐻, 𝑄5, 𝑄30 𝛽, 𝛾, 𝛿) P30 𝛾 Discriminative model 𝑄(𝑄5, 𝑄30, 𝐻, 𝛽, 𝛾, 𝛿) 𝑄 𝐻 𝑄5, 𝑄30, 𝛽, 𝛾, 𝛿)
� � � � 人工智能引论 2018 罗智凌 Maximum Likelihood Estimation • arg max 𝑄(𝐻|𝑄5, 𝑄30, 𝛽, 𝛾, 𝛿) • 𝑄(𝐻|𝑄5, 𝑄30, 𝛽, 𝛾, 𝛿) = ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿 𝑒𝜄 • -> arg min –log ( ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿 𝑒𝜄 )
� � 人工智能引论 2018 罗智凌 Maximize A Posterior 𝑄 𝛽, 𝛾, 𝛿 𝑄5, 𝑄30, 𝐻) = 𝑄(𝑄5, 𝑄30, 𝐻, 𝛽, 𝛾, 𝛿) 𝑄(𝑄5, 𝑄30, 𝐻) ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿 𝑒𝜄 𝑅 𝛽 𝑅 𝛾 𝑅(𝛿) = 𝑄(𝑄5, 𝑄30, 𝐻) P 是 𝛽, 𝛾, 𝛿 的函数,可以记作 𝑔(𝛽, 𝛾, 𝛿) 或者 𝑔(𝛽, 𝛾, 𝛿; 𝑄5, 𝑄30, 𝐻)
人工智能引论 2018 罗智凌 Maximize A Posterior • Log-likelihood Loss: −log (𝑔(𝛽, 𝛾, 𝛿)) • Regularization (Optimal): 𝜇 (∥ 𝛽 ∥+∥ 𝛾 ∥+∥ 𝛿 ∥) • Loss function (objective function): 𝑚 = − log 𝑔 𝛽, 𝛾, 𝛿 + 𝜇 (∥ 𝛽 ∥+∥ 𝛾 ∥+∥ 𝛿 ∥) 𝛽 ∗ , 𝛾 ∗ , 𝛿 ∗ = arg 𝑛𝑗𝑜 𝑚 • Solve as 通过随机梯度下降、爬山等已有方法工具 (MATLAB, Python) 求解
� � � � 人工智能引论 2018 罗智凌 MLE vs MAP • MLE: • arg 𝑛𝑏𝑦 ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿 𝑒𝜄 • MAP: • arg 𝑛𝑏𝑦 ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿 𝑒𝜄𝑅 𝛽 𝑅 𝛾 𝑅(𝛿) The prior on parameters
人工智能引论 2018 罗智凌 Understand LDA with MLE
人工智能引论 2018 罗智凌 Generative vs Discriminative Models
人工智能引论 2018 罗智凌 OUTLINE • Strategies • Algorithm – Gradient Descent (GD) – EM algorithm – Sampling algorithms • Applications
人工智能引论 2018 罗智凌 Gradient Descent
人工智能引论 2018 罗智凌 Batch/Stochastic gradient
人工智能引论 2018 罗智凌 Advanced Varients • Momentum SGD • Adagrad – Big learning rate for low-frequent param, small for high-frequent one. • Adadelta – Adagrad 的改进,用 local 的梯度平方和替换了全局的梯度平方 • Adam – 与 Adagrad 相似,增加了梯度的二阶矩,更稳定
人工智能引论 2018 罗智凌 Expectation–Maximization algorithm Given the statistical model which generates a set X of observed data, a set of unobserved latent data or missing values Z , and a vector of unknown parameters θ , along with a likelihood function L(θ;X,Z)=p(X,Z|θ) , the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps: • Expectation step (E step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of Z given X under the current estimate of the parameters θ(t) : • Maximization step (M step): Find the parameters that maximize this quantity:
人工智能引论 2018 罗智凌 Sampling • Conjugate distribution based sampling 1. The observation is a stochastic variable x in a distribution ∅ phi with parameters 𝜈 . 2. parameter 𝜈 has a known prior distribution f with hyper-parameter 𝜕 . 3. The pair of ∅ and f is in one of existing conjugate distributions . For example ∅ is normal distribution and its expectation f is also in the normal distribution. • 巧妙地根据条件概率函数选择先验函数,能使得后验与先验保 持同样的函数形式。
人工智能引论 2018 罗智凌 Discrete distributions
人工智能引论 2018 罗智凌 Conjugate Priors
人工智能引论 2018 罗智凌 Conjugate priors
人工智能引论 2018 罗智凌 Gibbs Sampling
人工智能引论 2018 罗智凌 OUTLINE • About AI • Preliminaries about Bayesian • Generative/Discriminative Model • Applications – Markov Model – Markov Network – Neural Network
人工智能引论 2018 罗智凌 Markov Rule • A discrete-time Markov chain is a sequence of random variables X1, X2, X3, ... with the Markov property, namely that the probability of moving to the next state depends only on the present state and not on the previous states • First-order Markov and p-order Markov
人工智能引论 2018 罗智凌 Random Field • Markov Random Field • Gibbs Random Field • Conditional Random Field • Gaussian Random Field
人工智能引论 2018 罗智凌 Markov Network • Markov Chain 𝛽 𝛾 P5 P5 P5 P30 P30 P30 𝜄 𝜄 G G
人工智能引论 2018 罗智凌 Hidden Markov Model • Markov chain rule: • 语音识别 • 手势、字体识别 • 故障检测
人工智能引论 2018 罗智凌 Markov Network • Hidden Markov Model P5 P5 P5 P30 P30 P30 𝜄 𝜄 G G
人工智能引论 2018 罗智凌 Markov Random Field • 信息编码 • 人口模拟模型 P5 P5 P5 P30 P30 P30 G G G
人工智能引论 2018 罗智凌 Neural Network • Intent Variable -> Hidden Layer • 自动特征混合(非线性混合) 𝜄 • 分类 / 回归 P5 G 𝜄 P30 𝜄
人工智能引论 2018 罗智凌 Mixtures of Gaussians
人工智能引论 2018 Gaussian mixture distribution 罗智凌 • Definition: • Introduce a K-dimensional binary random variable z = ( z 1 , z 2 , …, z K ) T Latent variable • If , then • Equivalent formulation of the Gaussian mixture: responsibility
人工智能引论 2018 Gaussian mixture distribution 罗智凌 responsibility
人工智能引论 2018 Gaussian mixture distribution 罗智凌
人工智能引论 2018 The difficulty of estimating parameters in GMM by ML 罗智凌 • The log of the likelihood function of GMM: • Issue #1: singularities – Collapses onto a specific data point • Issue #2: identifiability – Total K! equivalent solutions • Issue #3: no closed form solution – The derivatives of the log likelihood are complex.
人工智能引论 2018 Expectation-Maximization algorithm for GMM 罗智凌 E Step Each iteration will increase the log likelihood function. M Step • Solve µ k : -1 Weighting factor • Solve Σ k : • Solve π k :
人工智能引论 2018 Expectation-Maximization algorithm for GMM 罗智凌
人工智能引论 2018 EM algorithm for GMM: experiment 罗智凌 • The Old Faithful data set:
人工智能引论 2018 EM algorithm for GMM: experiment 罗智凌 • The Old Faithful data set: Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K -means algorithm
人工智能引论 2018 罗智凌 罗智凌 luozhiling@zju.edu.cn http://www.bruceluo.net
Recommend
More recommend