M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , - PowerPoint PPT Presentation

Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu

Sketch of the Talk A new loss function for supervised structured classification with arbitrary features. • Fast & easy to train - no partition functions! • Consistent estimator of the joint distribution • Information-theoretic interpretation • Some practical issues • Speed & accuracy comparison

Log-Linear Models as Classifiers Distribution: in out partition function parameters dot-product Classification: score dynamic programming, search, discrete optimization, etc.

Training Log-Linear Models Maximum Likelihood Estimation: pain Also, discriminative alternatives: • conditional random fields ( x -wise partition functions) • maximum margin training (decoding during training)

Notational Variant “some other” distribution Still log-linear.

Jeon and Lin (2006) A new loss function for training: exponentiated, base negated dot- distribution product scores

Attractive Properties of the M-Estimator  Computationally efficient.

Attractive Properties of the M-Estimator  Convex. exp is convex; affine composition → convex linear linear

Statistical Consistency • If the data were drawn from some distribution in the given family, parameterized by w * , then • True of MLE, Pseudolikelihood, and the M- estimator. – Conditional likelihood is consistent for the conditional distribution.

Information-Theoretic Interpretation • True model: p ? • Perturbation applied to p ? , resulting in q 0 • Goal: recover the true distribution by correcting the perturbation.

Information-Theoretic Interpretation • True model: p ? • Perturbation applied to p ? , resulting in q 0 • Goal: recover the true distribution by correcting the perturbation. MLE J&L’06

Minimizing KL Divergence

So far … • Alternative objective function for log-linear models. – Efficient to compute – Convex and differentiable – Easy to implement – Consistent • Interesting information-theoretic motivation. Next … • Practical issues • Experiments

q 0 Desiderata • Fast to estimate • Smooth • Straightforward calculation of E q 0 [f] Here: smoothed HMM. – See paper for details on E q 0 [f] - linear system! In general, can sample from q 0 to estimate.

Optimization Can use Quasi-Newton methods (L-BFGS, CG). The gradient:

Regularization Problem : If we estimate E q 0 [ f j ] = 0, then w j will tend toward - ∞ . Quadratic regularizer: Can be interpreted as a 0-mean, c -variance, diagonal Gaussian prior on w ; maximum a posteriori analog for the M-estimator.

Experiments • Data: CoNLL-2000 shallow parsing dataset • Task: NP-chunking (by B-I-O labeling) • Baseline/ q 0 : smoothed MLE trigram HMM; B-I-O label emits word and tag separately • Quadratic regularization for log-linear models, c selected on held-out.

B-I-O Example NNS IN NNS VB RB VBN JJR IN DT NNS B O B O O O O O B I Profits of franchises have n’t been higher since the mid-1970s

Experiments time precision recall F 1 (h:m:s) HMM 0:00:02 85.6 88.7 87.1 M-est. 1:01:37 88.9 90.4 89.6 MEMM 3:39:52 90.9 92.2 91.5 PL 9:34:52 91.9 91.8 91.8 CRF 64:18:24 94.0 93.7 93.9 rich features (Sha & Pereira ‘03)

Accuracy, Training Time, and c under-regularization hurts

Generative/Discriminative vs. Features more than additive

18 Minutes Are Not Enough • See the paper – q 0 experiments – negative result: attempt to “make it discriminative” • WSJ section 22 dependency parsing – generative baseline/ q 0 ( ≈ Klein & Manning ‘03) – 85.2% → 86.4% – 2 million → 3 million features ( ≈ McDonald et al. ‘05) – 4 hours training per value of c

Ongoing & Future Work • Discriminative training works better but takes longer. – Cases where discriminative training may be too expensive • high complexity inference (parsing) • n is very large (MT?) – Is there an efficient estimator like this for the conditional distribution? • Hidden variables increase complexity, too. – Use M-estimator for M step in EM? – Is there an efficient estimator like this that handles hidden variables?

Conclusion • M-estimation is accuracy – fast to train (no partition functions) – easy to implement – statistically consistent runtime – feature-empowered (like CRFs) – generative A new point on the spectrum of speed/ accuracy/expressiveness tradeoffs.

Thanks!

How important is the choice of q 0 ? • MAP-trained HMM • Empirical marginal: • Locally uniform model – Uniform transitions 4 out-arcs 4 out-arcs – No temporal effects B I – 0% precision, recall O 3 out-arcs

q 0 Experiments select c to F 1 precision recall q 0 maximize: baseline HMM (no M-est.) 85.6 88.7 87.1 88.9 90.4 89.6 HMM F 1 empirical F 1 84.4 89.4 86.8 marginal locally F 1 72.9 57.6 64.3 uniform precision 84.4 37.7 52.1 transitions

Negative Result: Input-Only Features Idea : Make M-estimator “more discriminative” by including features of words/tags only. • Think of the model in two parts: Improve fit here … … by doing more of the → Virtually no effect. “explanatory work” here.

M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , - PowerPoint PPT Presentation

Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu Sketch of the Talk A new loss function

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

ESTIMATION AS UNCERTAINTY REDUCTION What is this estimation thing, anyway? Michael Godeck

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Estimation theory Parametric estimation Properties of estimators Minimum variance

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

State estimation approach to nonstationary Introduction inverse problems State estimation

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

Deception and Estimation: Deception and Estimation: How We Fool Ourselves How We Fool Ourselves

Yes, complexity pseudoinverse methods Ola Hrkegrd Active Set Algorithms for

COVID-19 vaccines: Policy Questions, Evidence to Recommendation Framework & Critical and

The Brother in Law Effect David K. Levine, Federico Weinschelbaum and Felipe Zurita June 21, 2006

GPP 501 Microeconomic Analysis for Public Policy Fall 2017 Given by Kevin Milligan Vancouver

Efficient Policy Learning from Surrogate-Loss Classifications Andrew Bennett (Cornell Tech)

Efficient estimators in nonlinear and heteroscedastic autoregressive models with constraints

Estimation Theory Overview Introduction Up until now we have defined and discussed properties

The Interplay of Information Theory, Probability, and Statistics Andrew Barron Yale University,

M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , - PowerPoint PPT Presentation

Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu Sketch of the Talk A new loss function

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

ESTIMATION AS UNCERTAINTY REDUCTION What is this estimation thing, anyway? Michael Godeck

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Estimation theory Parametric estimation Properties of estimators Minimum variance

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

State estimation approach to nonstationary Introduction inverse problems State estimation

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

Deception and Estimation: Deception and Estimation: How We Fool Ourselves How We Fool Ourselves

Yes, complexity pseudoinverse methods Ola Hrkegrd Active Set Algorithms for

COVID-19 vaccines: Policy Questions, Evidence to Recommendation Framework &amp; Critical and

The Brother in Law Effect David K. Levine, Federico Weinschelbaum and Felipe Zurita June 21, 2006

GPP 501 Microeconomic Analysis for Public Policy Fall 2017 Given by Kevin Milligan Vancouver

Efficient Policy Learning from Surrogate-Loss Classifications Andrew Bennett (Cornell Tech)

Efficient estimators in nonlinear and heteroscedastic autoregressive models with constraints

Estimation Theory Overview Introduction Up until now we have defined and discussed properties

The Interplay of Information Theory, Probability, and Statistics Andrew Barron Yale University,

COVID-19 vaccines: Policy Questions, Evidence to Recommendation Framework & Critical and