Statistical revolution in computational linguistics ◮ Speech recognition Statistics and ◮ Syntactic parsing ◮ Machine translation the Scientific Study of Language 0.92 What do they have to do with each other? 0.91 0.9 Parse 0.89 Accuracy Mark Johnson 0.88 0.87 Brown University 0.86 0.85 ESSLLI 2005 0.84 1994 1996 1998 2000 2002 2004 2006 Year Statistical models in computational linguistics Outline Why Statistics? ◮ Supervised learning: structure to be learned is visible ◮ speech transcripts, treebank, proposition bank, Learning probabilistic context-free grammars translation pairs ◮ more information than available to a child ◮ annotation requires (linguistic) knowledge Factoring learning into simpler components ◮ a more practical method of making information available to a computer than writing a grammar by hand The Janus-faced nature of computational linguistics ◮ Unsupervised learning: structure to be learned is hidden ◮ alien radio, alien TV Conclusion
The centrality of inference Chomsky’s “Three Questions” ◮ “poverty of the stimulus” ⇒ innate knowledge of language (universal grammar) ◮ What constitutes knowledge of language? ⇒ intricate grammar with rich deductive structure ◮ grammar (universal, language specific) ◮ Statistics is the theory of optimal inference in the ◮ How is knowledge of language acquired? presence of uncertainty ◮ language acquisition ◮ We can define probability distributions over structured objects ◮ How is knowledge of language put to use? ⇒ no inherent contradiction between statistical inference ◮ psycholinguistics and linguistic structure ◮ probabilistic models are declarative (last two questions are about inference) ◮ probabilistic models can be systematically combined P( X , Y ) = P( X )P( Y | X ) Questions that statistical models might answer The centrality of inference ◮ “poverty of the stimulus” ◮ What information is required to learn language? ⇒ innate knowledge of language (universal grammar) ◮ How useful are different kinds of information to language ⇒ intricate grammar with rich deductive structure learners? ◮ Bayesian inference can utilize prior knowledge ◮ Prior can encode “soft” markedness preferences and “hard” universal constraints ◮ Are there synergies between different information sources? ◮ Does knowledge of phonology or morphology make word segmentation easier? ◮ May provide hints about human language acquisition
Estimating PCFGs from visible data Outline S S S Why Statistics? NP VP NP VP NP VP rice grows rice grows corn grows Learning probabilistic context-free grammars Rule Count Rel Freq S S → NP VP 3 1 Factoring learning into simpler components P = 2 / 3 NP VP NP → rice 2 2 / 3 NP → corn 1 1 / 3 rice grows The Janus-faced nature of computational linguistics VP → grows 3 1 S Conclusion Rel freq is maximum likelihood estimator P = 1 / 3 NP VP (selects rule probabilities that maximize probability of trees) corn grows Estimating PCFGs from hidden data Probabilistic Context-Free Grammars ◮ Training data consists of strings w alone 1 . 0 S → NP VP 1 . 0 VP → V ◮ Maximum likelihood selects rule probabilities that 0 . 75 NP → George 0 . 25 NP → Al maximize the marginal probability of the strings w 0 . 6 V → barks 0 . 4 V → snores ◮ Expectation maximization is a way of building hidden data estimators out of visible data estimators S S ◮ parse trees of iteration i are training data for rule NP VP NP VP probabilities at iteration i + 1 P = 0 . 45 P = 0 . 1 ◮ Each iteration is guaranteed not to decrease P( w ) (but George V Al V can get trapped in local minima) barks snores ◮ This can be done without enumerating the parses
Rule probabilities from “English” Example: The EM algorithm with a toy PCFG 1 VP → V NP 0.9 VP → NP V Initial rule probs VP → V NP NP “English” input 0.8 VP → NP NP V rule prob the dog bites 0.7 Det → the · · · · · · the dog bites a man N → the VP → V 0 . 2 Rule 0.6 V → the a man gives the dog a bone probability VP → V NP 0 . 2 · · · 0.5 VP → NP V 0 . 2 0.4 VP → V NP NP 0 . 2 “pseudo-Japanese” input VP → NP NP V 0 . 2 0.3 the dog bites · · · · · · 0.2 the dog a man bites Det → the 0 . 1 a man the dog a bone gives 0.1 N → the 0 . 1 · · · V → the 0 . 1 0 0 1 2 3 4 5 Iteration Probability of “Japanese” Probability of “English” 1 1 0.1 0.1 Geometric Geometric average average 0.01 0.01 sentence sentence probability probability 0.001 0.001 1e-04 1e-04 1e-05 1e-05 1e-06 1e-06 0 1 2 3 4 5 0 1 2 3 4 5 Iteration Iteration
Applying EM to real data Rule probabilities from “Japanese” ◮ ATIS treebank consists of 1,300 hand-constructed parse 1 trees VP → V NP 0.9 VP → NP V ◮ ignore the words (in this experiment) VP → V NP NP 0.8 VP → NP NP V ◮ about 1,000 PCFG rules are needed to build these trees 0.7 Det → the N → the S Rule 0.6 V → the probability VP . 0.5 VB NP NP . 0.4 0.3 Show PRP NP DT JJ NNS PP ADJP 0.2 me PDT the nonstop flights PP PP JJ PP 0.1 all IN NP TO NP early IN NP 0 from NNP to NNP in DT NN 0 1 2 3 4 5 Dallas Denver the morning Iteration Experiments with EM Learning in statistical paradigm 1. Extract productions from trees and estimate probabilities ◮ The likelihood is a differentiable function of rule probabilities from trees to produce PCFG. probabilities 2. Initialize EM with the treebank grammar and MLE ⇒ learning can involve small, incremental updates probabilities ◮ Learning structure (rules) is hard, but . . . 3. Apply EM (to strings alone) to re-estimate production ◮ Parameter estimation can approximate rule learning probabilities. ◮ start with “superset” grammar 4. At each iteration: ◮ estimate rule probabilities ◮ Measure the likelihood of the training data and the ◮ discard low probability rules quality of the parses produced by each grammar. ◮ Parameters can be associated with other things besides ◮ Test on training data (so poor performance is not due to rules (e.g., HeadInitial, HeadFinal) overlearning).
Why does it work so poorly? Log likelihood of training strings ◮ Wrong data: grammar is a transduction between form -14000 and meaning ⇒ learn from form/meaning pairs -14200 ◮ exactly what contextual information is available to a -14400 language learner? -14600 ◮ Wrong model: PCFGs are poor models of syntax -14800 ◮ Wrong objective function: Maximum likelihood makes the log P -15000 sentences as likely as possible, but syntax isn’t intended -15200 to predict sentences (Klein and Manning) ◮ How can information about the marginal distribution of -15400 strings P( w ) provide information about the conditional -15600 distribution of parses t given strings P( t | w )? -15800 ◮ need additional linking assumptions about the -16000 relationship between parses and strings 0 5 10 15 20 ◮ . . . but no one really knows! Iteration Outline Quality of ML parses 1 Precision Why Statistics? Recall 0.95 Learning probabilistic context-free grammars 0.9 Parse Accuracy 0.85 Factoring learning into simpler components 0.8 The Janus-faced nature of computational linguistics 0.75 Conclusion 0.7 0 5 10 15 20 Iteration
Concatenative morphology Factoring the language learning problem Verb ◮ Factor the language learning problem into linguistically simpler components Stem Suffix ◮ Focus on components that might be less dependent on t a l k i n g Data = t a l k i n g context and semantics (e.g., word segmentation, phonology) ◮ Identify relevant information sources (including prior Verb → Stem Suffix knowledge, e.g., UG) by comparing models Stem → w w ∈ Σ ⋆ ◮ Combine components to produce more ambitious learners Suffix → w w ∈ Σ ⋆ ◮ PCFG-like grammars are a natural way to formulate many ◮ Morphological alternation provides primary evidence for of these components phonological generalizations (“trucks” /s/ vs. “cars” /z/) ◮ Morphemes may also provide clues for word segmentation ◮ Algorithms for doing this already exist (e.g., Goldsmith) Joint work with Sharon Goldwater and Tom Griffiths PCFG components can be integrated Word Segmentation Utterance Utterance Word Utterance WordsN t h e Word Utterance N WordsV d o g Word StemN SuffixN V Data = t h e d o g b a r k s b a r k s d o g s StemV SuffixV Utterance → Word Utterance b a r k Utterance → Word Word → w w ∈ Σ ⋆ Utterance → Words S S ∈ S Words S → S Words T T ∈ S ◮ Algorithms for word segmentation from this information S → Stem S Suffix S already exists (e.g., Elman, Brent) Stem S → t t ∈ Σ ⋆ ◮ Likely that children perform some word segmentation Suffix S → f f ∈ Σ ⋆ before they know the meanings of words
Recommend
More recommend