Laplace Max-margin Markov Networks Recent Recent A Advances in dvances in L Learning earning SPARSE SPARSE Structured S tructured I I/ /O O M Model odels s: : models, algorithms, and applications models, algorithms, and applications Eric Xing Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology Inst./Computer Science Dept. Carnegie Mellon University Carnegie Mellon University 1 1 VLPR 2009 @ Beijing, China 8/6/2009 Structured Prediction Problem � Unstructured prediction � Structured prediction � Part of speech tagging ⇒ “ Do you want sugar in it? ” <verb pron verb noun prep pron> � Image segmentation 2 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Classical Predictive Models • Inputs: – a set of training samples , where and • Outputs: – a predictive function : • Examples: – Logistic Regression, Bayes classifiers – Support Vector Machines (SVM) Max-margin learning • Max-likelihood estimation • E.g.: Advantages: Advantages: 1.Full probabilistic semantics 1.Dual sparsity: few support vectors 2.Straightforward Bayesian or direct regularization 2.Kernel tricks 3 VLPR 2009 @ Beijing, China 3.Hidden structures or generative hierarchy 8/6/2009 3.Strong empirical results Structured Prediction Models � Conditional Random Fields � Max-margin Markov Networks (CRFs) (Lafferty et al 2001) (M 3 Ns) (Taskar et al 2003) Based on Logistic Regression – Based on SVM – Max-likelihood estimation (point- – Max-margin learning ( point-estimate) – estimate) Challenges: ACGTTTTACTGTACAATT • SPARSE prediction model • Prior information of structures • Scalable to large-scale problems (e.g., 10 4 input/output dimension) 4 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Outline � Structured sparse regression � Graph-guided fused lasso: unlinked SNPs to trait networks (Kim and Xing, PLoS Genetics) � Temporally-smoothed graph regression: learning time-varying graphs (Ahmed and Xing, PNAS 2009, Kolar and Xing, under review, Annals of Statistics) � Maximum entropy discrimination Markov networks – General Theorems (Zhu and Xing, JMLR submitted) – Gaussian MEDN: reduction to M 3 N (Zhu, Xing and Zhang, ICML 08) – Laplace MEDN: a sparse M 3 N (Zhu, Xing and Zhang, ICML 08) – Partially observed MEDN: (Zhu, Xing and Zhang, NIPS 08) – Max-margin/Max entropy topic model: (Zhu, Ahmed, and Xing, ICML 09) 5 VLPR 2009 @ Beijing, China 8/6/2009 Max-Margin Learning Paradigms SVM SVM M 3 N M 3 N b r a c e MED-MN MED MED ? = SMED + “Bayesian” M 3 N Primal and Dual Sparse! 6 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Primal and Dual Problems of M 3 Ns � Dual problem: � Primal problem: � Algorithms � Algorithms: Cutting plane – – SMO Sub-gradient – – Exponentiated gradient … – – … • So, M 3 N is dual sparse! 7 VLPR 2009 @ Beijing, China 8/6/2009 MLE versus max-margin learning � Likelihood-based estimation • Max-margin learning – Probabilistic (joint/conditional – Non-probabilistic (concentrate on input- likelihood model) output mapping) – Not obvious how to perform Bayesian – Easy to perform Bayesian learning, and incorporate prior knowledge, learning or consider prior, and missing data latent structures, missing data – Sound theoretical guarantee with limited samples – Bayesian regularization!! Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) • Model averaging – – The optimization problem (binary classification) 8 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks MaxEnt Discrimination Markov Network � Structured MaxEnt Discrimination (SMED): � Feasible subspace of weight distribution: � Average from distribution of M 3 Ns p 9 VLPR 2009 @ Beijing, China 8/6/2009 Solution to MaxEnDNet � Theorem 1: – Posterior Distribution: – Dual Optimization Problem: 10 10 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Gaussian MaxEnDNet (reduction to M 3 N) � Theorem 2 – Assume � Posterior distribution: � Dual optimization: � Predictive rule: � Thus, MaxEnDNet subsumes M 3 Ns and admits all the merits of max-margin learning � Furthermore, MaxEnDNet has at least three advantages … 11 11 VLPR 2009 @ Beijing, China 8/6/2009 Three Advantages � An averaging Model: PAC-Bayesian prediction error guarantee � Entropy regularization: Introducing useful biases � Standard Normal prior => reduction to standard M 3 N (we’ve seen it) � Laplace prior => Posterior shrinkage effects (sparse M 3 N) � Integrating Generative and Discriminative principles � Incorporate latent variables and structures (PoMEN) � Semisupervised learning (with partially labeled data) 12 12 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks I: Generalization Guarantee � MaxEntNet is an averaging model � Theorem 3 (PAC-Bayes Bound) If Then 13 13 VLPR 2009 @ Beijing, China 8/6/2009 II: Laplace MaxEnDNet (primal sparse M 3 N) � Laplace Prior: � Corollary 4: � Under a Laplace MaxEnDNet, the posterior mean of parameter vector w is: � The Gaussian MaxEnDNet and the regular M 3 N has no such shrinkage � there, we have 14 14 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Lap MEDN vs. L 2 and L 1 regularization � Corollary 5: LapMEDN corresponding to solving the following primal optimization problem: � KL norm: L 1 and L 2 norms KL norms 15 15 VLPR 2009 @ Beijing, China 8/6/2009 Variational Learning of Lap MEDN � Exact dual function is hard to optimize � Use the hierarchical representation, we get: � We optimize an upper bound: � Why is it easier? – Alternating minimization leads to nicer optimization problems 16 16 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Experimental results on OCR datasets � We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV. 17 17 VLPR 2009 @ Beijing, China 8/6/2009 Feature Selection 18 18 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Sensitivity to Regularization Constants � L 1 -CRF and L 2 -CRF: - 0.001, 0.01, 0.1, 1, 4, 9, 16 � M 3 N and LapM 3 N: - 1, 4, 9, 16, 25, 36, 49, 64, 81 • L 1 -CRFs are much sensitive to regularization constants; the others are more stable • LapM 3 N is the most stable one 19 19 VLPR 2009 @ Beijing, China 8/6/2009 III: Latent Hierarchical MaxEnDNet � Web data extraction � Goal: Name , Image , Price , Description , etc. Given Data Record � Hierarchical labeling {Head} {Info Block} {Tail} � Advantages: o Computational efficiency {Note} {Repeat block} {Note} o Long-range dependency o Joint extraction {image} {name, price} {image} {name, price} {name} {price} {desc} {name} {price} 20 20 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Partially Observed MaxEnDNet (PoMEN) � Now we are given partially labeled data: � PoMEN: learning � Prediction: 21 21 VLPR 2009 @ Beijing, China 8/6/2009 Alternating Minimization Alg. � Factorization assumption: � Alternating minimization: � Step 1: keep fixed, optimize over o Normal prior • M 3 N problem (QP) o Laplace prior • Laplace M 3 N problem (VB) � Step 2: keep fixed, optimize over Equivalently reduced to an LP with a polynomial number of constraints 22 22 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks Record-Level Evaluations � Overall performance: � Avg F1: o avg F1 over all attributes � Block instance accuracy: o % of records whose Name , Image , and Price are correct � Attribute performance: 23 23 VLPR 2009 @ Beijing, China 8/6/2009 VI: Max-Margin/Max Entropy Topic Model – MED-LDA (from images.google.cn) 24 24 VLPR 2009 @ Beijing, China 8/6/2009
Laplace Max-margin Markov Networks LDA: a generative story for documents � Bag-of-word representation of documents � Each word is generated by ONE topic � Each document is a random mixture over topics Topic #1 Document #1: gif jpg image current image, jpg, gif, file, color, file, images, files, format file color images ground power file current format file formats circuit gif images Topic #2 Document #2: wire currents file format ground power image format wire circuit ground, wire, power, wiring, current, circuit, current wiring ground circuit images files… Bayesian Approach Mixture Components Mixture Weights LDA Dirichlet Prior 25 25 25 25 VLPR 2009 @ Beijing, China 8/6/2009 LDA: Latent Dirichlet Allocation (Blei et al., 2003) � Generative Procedure: � For each document d : � Sample a topic proportion � For each word: – Sample a topic – Sample a word � Joint Distribution: exact inference intractable! � Variational Inference with : � Minimize the variational bound to estimate parameters and infer the posterior distribution 26 26 26 26 VLPR 2009 @ Beijing, China 8/6/2009
Recommend
More recommend