an introduction to computational psycholinguistics
play

An introduction to computational psycholinguistics: Modeling human - PDF document

An introduction to computational psycholinguistics: Modeling human sentence processing Shravan Vasishth University of Potsdam, Germany http://www.ling.uni-potsdam.de/ vasishth vasishth@acm.org September 2005, Bochum Probabilistic models:


  1. An introduction to computational psycholinguistics: Modeling human sentence processing Shravan Vasishth University of Potsdam, Germany http://www.ling.uni-potsdam.de/ ∼ vasishth vasishth@acm.org September 2005, Bochum Probabilistic models: (Crocker & Keller, 2005) • In ambiguous sentences, a preferred interpretation is immediately assigned, with later backtracking to reanalyze. (1) a. The horse raced past the barn fell. (Bever 1970) b. After the student moved the chair broke. c. The daughter of the colonel who was standing by the window. • On what basis do humans choose one over the other interpretation? • One plausible answer: experience. (Can you think of some others?) • We’ve seen an instance of how experience could determine parsing decisions: connectionist models. 1

  2. The role of linguistic experience • Experience: the number of times the speaker has encountered a particular entity in the past. • It’s impractical to measure or quantify experience of a particular entity based on the entire set of linguistic items seen by a speaker; but we can estimate them through (e.g.) corpora, norming (e.g., sentence completion) studies. Consider the S/NP ambiguity of the verb “know”: • There is a reliable correlation between corpora and norming studies (Lapata, Keller, & Schulte im Walde, 2001). (2) The teacher knew . . . • The critical issue is how the human processor uses this experience to resolve ambiguities, and at what level of granularity experience plays a role (lexical, syntactic structure, verb frames). 2 The granularity issue • It’s clear that lexical frequencies play a role. But are frequencies used at the lemma level or the token level? (Roland and Jurafsky 2002) • Structural frequencies: do we use frequencies of individual phrase structure rules? Probabilistic modelers say: yes. 3

  3. Probabilistic grammars • Context-free grammar rules: S -> NP VP NP -> Det N • Probabilities associated with each rule, derived from a (treebank) corpus: 1.0 S -> NP VP 0.7 NP -> Det N 0.3 NP -> Nplural • A normalization constraint on the PCFG: the probabilities of all rules with the same LHS must sum to 1. (See appendix A of Hale 2003). P ( N i → ζ j ) =  Y ∀ i (1) j 4 • The probability of a parse tree is the product of the rule probabilities. Y P ( t ) = P ( N → ζ ) (2) ( N → ζ ) ∈ R • Jurafsky (1996) has suggested that the probability of a grammar rule models the ease with which the rule can be accessed by the human sentence processor. • Example from Crocker and Keller (2005) shows how this setup can be used to predict parse preferences. • Further reading: Manning and Sch¨ utze, and Jurafsky and Martin. • NLTK demo. 5

  4. Estimating the rule probabilities and parsing • Maximum likelihood: estimates the probability of a rule based on the number of times it occurs in a treebank corpus. • Expectation maximization: given a grammar, computes a set of rule probabilities that make the sentences maximally likely. • Viterbi algorithm for computing the best parse. 6 Linking probabilities to processing difficulty • The goal is usually to model reading time, acceptability judgments. • Possible measures: – probability ratios of alternatives – Entropy reduction during incremental parsing: very different approach from computing probabilities of parses. Hale (2003): “cognitive load is related, perhaps linearly, to the reduction in the perceiver’s uncertainty about what the producer meant.” (p. 102) 7

  5. Some assumptions in Hale’s approach • During comprehension, sentence understanders determine a syntactic structure for the perceived signal. • Producer and comprehender share the same grammar. • Comprehension is eager; no processing is deferred beyond the first point at which it could happen. We’re now going to look at the mechanism he builds up, starting with the incredibly beautiful notion of entropy. In the following discussion I rely on (Schneider, 2005). 8 An introduction to entropy Imagine a device D  that can emit three symbols A, B, C. • Before D  can emit anything, we are uncertain about which symbol among the three possible ones it will emit. We can quantify this uncertainty and say it’s 3. • Now a symbol appears, say, A. Our uncertainty decreases. To 2. In other words, we’ve received some information. • Information is decrease in uncertainty. • Now suppose that another machine D  emits 1 or 2. • The composition of D  × D  results in 6 possible emissions: A1, A2, B1, B2, C1, C2. So what’s our uncertainty now? • It would be nice to be able to talk about increase in uncertainty additively–hence the use of logarithms. • D  : log  (3) is the new uncertainty; D  log  (2). D  × D  =log  (6). When we use base 2, the units of uncertainty are in bits, base 10 (units: digits), e (unit: nats/nits) also possible. 9

  6. Question If a device emits only one symbol, what’s the uncertainty? How many bits? 10 Deriving the formula for uncertainty Let M be the number of symbol-emissions possible. Uncertainty is now log  ( M ) . (From now on, log means log  .) log M = − log( M −  ) (3) = − log( 1 M ) (4) . . . [ Letting P = 1 = − log( P ) M ] (5) (6) Let P i be the various probabilities of the symbols, such that M X P i =  (7) i =1 11

  7. Surprise The surprise we get when we see the i th symbol is called “surprisal”. u i = − log( P i ) (8) if P i =  then u i = ∞ . . . we are very surprised if P i =  then u i =  . . . we are not surprised Average surprisal for an infinite string of symbols: Assume a string of length N, M symbols. Let the i th symbol appear N i times: M X N = (9) N i i =1 12 Since there are N i cases of u i , average surprisal is: M P N i × u i M N i × u i i =1 X = (10) P N i N i =1 i =1 if we measure this for an infinite string of symbols, frequency Ni N approaches probability P i . M X H = P i × u i (11) i =1 M X = − P i × P log P i (12) i =1 (13) 13

  8. That is Shannon’s formula for uncertainty. Search on Google for Claude Shannon, and you will find the original paper. Read it and memorize it. 14 Exercise Suppose P  = P  = · · · = P M . M X H = − P i log P i =? (14) i =1 15

  9. Solution  Suppose P  = P  = · · · = P M . Then P i = M for i = 1 · · · m . »  M M log  M + · · · +  M log  – X H = − P i log P i = − (15) M i =1 = − 1 M × M log 1 (16) M = log M (17) Recall our earlier idea that if we have M outcomes we can express our uncertainty as log( M ) . Device D  was log(3) , for example. 16 Another exercise Let p  = , p  = , · · · , p  =  . What is the average uncertainty or entropy H? 17

  10. Hale’s approach • Assume a grammar G . • Let T G represent all the possible derivations of G ; each derivation has a probability. • Let W be all the possible strings in G . Let’s represent the information conveyed by the first i words of a sentence generated by G by: I ( T G | W  ··· i ) . I ( T G | W  ··· i ) = H ( T G ) − H ( T G | W ...i ) (18) 18 Hale’s approach I ( T G | W  ··· i ) = H ( T G ) − H ( T G | W ...i ) (19) The above is just like our example with device D  : I ( D  | A ) = H ( D  ) − H ( D  | A ) (20) =3 − 2 (21) =1 (22) The above example with D  matches our intuition: once an A has been emitted, our uncertainty has been reduced by 1. 19

  11. Information conveyed by a particular word I ( T G | W i = w i ) = H ( T G | w ...i −  ) − H ( T G | w ...i ) (23) This gives us the information a comprehender gets from a word. 20 The entropy of a grammar symbol, say VP, is the sum of • the entropy of a single-rule rewrite decision • the expected entropy of any children of that symbol (Grenander 1967) 21

  12. The entropy of a grammar symbol Let the productions in a PCFG G be Q . For a left-hand side ξ , the rules rewriting it are Q ( ξ ) . Example: VP -> V NP VP -> V Let h be a vector that is indexed by ξ i , the symbols. So we can say h ( ξ  ) to mean, say, the first cell in the vector. For any grammar symbol ξ i we can compute the Entropy using the usual formula: X h ( ξ  ) = − p r log p r (24) r ∈ Q ( ξ ) 22 The entropy of a grammar symbol More generally: X h ( ξ i ) = − p r log p r (25) r ∈ Q ( ξi ) So now the vector h has the Entropies of each ξ i . 23

  13. The entropy of a grammar symbol, say VP, is the sum of • the entropy of a single-rule rewrite decision ✔ • the expected entropy of any children of that symbol Now we work on the second part. 24 The expected entropy of any children of a symbol Suppose Rule r rewrites s nonterminal ξ i as n daughters. Rule r1 VP -> V NP Rule r2 VP -> V ξ  = V P Q ( V P ) = r 1 , r 2 H ( V P ) = h ( V P ) + [ H ( V r ) + H ( NP r ) + H ( V r )] (26) 25

  14. The expected entropy of any children of a symbol More generally: X H ( ξ i ) = h ( ξ i ) + [ H ( ξ i ) + · · · + H ( ξ in )] (27) r ∈ Q ( ξi ) 26 The information conveyed by a word I ( T G | W i = w i ) = H ( T G | w ...i −  ) − H ( T G | w ...i ) (28) 27

  15. Example 28 The horse raced past the barn fell Initial conditional entropy of the parser state: H G ( S ) = . . Input “the”: Every sentence in the grammar begins with “the”, so information gained, no reduction in entropy. 29

  16. The horse raced 30 The horse raced past the barn fell 31

  17. Subject relatives 32 Object relatives 33

Recommend


More recommend