word sense disambiguation wsd
play

Word Sense Disambiguation (WSD) Based on Foundations of Statistical - PowerPoint PPT Presentation

0. Word Sense Disambiguation (WSD) Based on Foundations of Statistical NLP by C. Manning & H. Sch utze, ch. 7 MIT Press, 2002 1. WSD Examples They have the right to bear arms. ( drept ) The sign on the right was bent. ( diret ie


  1. 0. Word Sense Disambiguation (WSD) Based on “Foundations of Statistical NLP” by C. Manning & H. Sch¨ utze, ch. 7 MIT Press, 2002

  2. 1. WSD Examples They have the right to bear arms. ( drept ) The sign on the right was bent. ( diret ¸ie ) The plant is producing far too little to sustain its op- eration for more than a year. ( fabric˘ a ) An overboundance of oxygen was produced by the plant in the third week of the study. ( plant˘ a ) The tank has a top speed of 70 miles an hour, which it can sustain for 3 hours. ( tanc petrolier ) We cannot fill more gasoline in the tank. ( rezervor de a ) ma¸ sin˘ The tank is full of soldiers. ( tank de lupt˘ a ) The tank is full of nitrogen. ( recipient )

  3. 2. Plan 1. Supervised WSD 1.1 A Naive Bayes Learning algorithm for WSD 1.2 An Information Theoretic algorithm for WSD 2. Unsupervised WSD clustering 2.1 WS clustering: the EM algorithm 2.2 Constraint-based WSD: “One sense per discourse, one sense per colloca- tion”: Yarowsky’s algorithm 2.3 Resource-based WSD 2.3.1 Dictionary-based WSD: Lesk’s algorithm 2.3.2 Thesaurus-based WSD: Walker’s algorithm Yarowsky’s algorithm

  4. 3. 1.1 Supervised WSD through Naive Bayesian Classification s ′ = argmax s k P ( s k | c ) = argmax s k P ( c | s k ) P ( c ) P ( s k ) = argmax s k P ( c | s k ) P ( s k ) = argmax s k [ log P ( c | s k ) + log P ( s k )] = argmax s k [ log P ( s k ) + Σ w j in c log P ( w j | s k )] where we used the naive Bayes assumption: P ( c | s k ) = P ( { w j | w j in c } | s k ) = Π w j in c P ( w j | s k ) The Maximum Likelihood estimation: P ( w j | s k ) = C ( w j ,s k ) and P ( s k ) = C ( w,s k ) where: C ( s k ) C ( w ) C ( w j , s k ) = number of occurrences of word w j with the sense s k C ( w, s k ) = number of occurrences of word w with the sense s k C ( w ) = number of occurrences of the ambiguous word w all counted in the training corpus.

  5. 4. A Naive Bayes Algorithm for WSD comment: training 1 for all senses s k of w do 2 for all words w j in the vocabulary do 3 P ( w j | s k ) = C ( w j ,s k ) 4 C ( w j ) end 5 end 6 for all senses s k of w do 7 P ( s k ) = C ( w,s k ) 8 C ( w ) end 9 comment: Disambiguation 10 for all senses s k of w do 11 score( s k ) = log( P ( s k ) ) 12 for all words w j in the context window c do 13 score( s k ) = score( s k ) + log( P ( w j | s k ) ) 14 end 15 end 16 choose s ′ = argmax s k score( s k ) 17

  6. 5. 1.2 An Information Theoretic approach to WSD Remark: The Naive Bayes classifier attempts to use information from words in the context window to help in the disambiguation decision. It does this at the cost of a somewhat unrealistic indepen- dence assumption. The IT (“Flip-Flop”) algorithm that follows does the oppo- site: It tries to find a single contextual feature that reliably in- dicates which sense of the ambiguous word is being used. Empirical result: The Flip-Flop algorithm improved by 20% the accuracy of a Machine Translation system.

  7. 6. Example Highly informative indicators for 3 ambiguous French words Ambiguous Indicator Examples: word value → sense prendre object mesure → to take d´ ecision → to make vuloir tense present → to want conditional → to like cent word to the left per → % number → c. [money]

  8. 7. Notations t 1 , ..., t m — translations of the ambiguous word example: prendre → take, make, rise, speak x 1 , ..., x n — possible values of the indicator example: mesure, d´ ecision, example, note, parole Mutual Information: p ( x,y ) I ( X ; Y ) = Σ x ∈ X Σ y ∈ Y p ( x, y ) log p ( x ) p ( y ) Note: The Flip-Flop algorithm only disambiguates between 2 senses. For the more general case see [Brown et al., 1991a].

  9. 8. Brown et al.’s WSD (“Flip-Flop”) algorithm: Finding indicators for disambiguation find random partion P = { P 1 , P 2 } of t 1 , ..., t m 1 while (improving I ( P ; Q ) ) do 2 find partion Q = { Q 1 , Q 2 } of x 1 , ..., x n 3 that maximizes I ( P ; Q ) 4 find partion { P 1 , P 2 } of t 1 , ..., t m 5 that maximizes I ( P ; Q ) 6 end 7

  10. 9. “Flip-Flop” algorithm Note: Using the splitting theorem, [Breiman et al., 1984], it can be shown that the Fklip-Flop algorithm monotonically increases I ( P ; Q ) . Stopping criterion: I ( P ; Q ) does not increases anymore (significantly).

  11. 10. “Flip-Flop” algorithm: Disambiguation For the occurrence of the ambiguous word, determine the value x i of the indicator; if x i ∈ Q 1 then assign the occurence the sense 1; if x i ∈ Q 2 then assign the occurence the sense 2.

  12. A running example 11. 1. a randomly taken partition P = { P 1 , P 2 } : P 1 = { take, rise } , P 2 = { make, speak } 2. maximizing I ( P ; Q ) on Q , using the (presumable) data take a measure notes an example make a decision a speech rise to speak Q 1 = { measure, note, example } , Q 2 = { d´ ecision, parole } 3. maximizing I ( P ; Q ) on P : P 1 = { take } , P 2 = { make, rise, speak } Note: Consider more than 2 ‘senses’ to distinguish between { make, rise, speak } .

  13. 12. 2. Unsupervised word sense clustering 2.1 The EM algorithm Notation: K is the number of desired senses; c 1 , c 2 , ..., c I are the contexts of the ambiguous word in the corpus; w 1 , w 2 , ..., w J are the words being used as disambiguating features. Parameters of the model ( µ ): P ( w j | s k ) , 1 ≤ j ≤ J, 1 ≤ k ≤ K P ( s k ) , 1 ≤ k ≤ K . Given µ , the log-likelihood of the corpus C is computed as: l ( C | µ ) = log Π I i =1 P ( c i ) = log Π I i =1 Σ K k =1 P ( c i | s k ) P ( s k ) = Σ I i =1 log Σ K k =1 P ( c i | s k ) P ( s k ) Note: To compute P ( c i | s k ) , use the Naive Bayes assumption: P ( c i | s k ) = Π w j in c i P ( w j | s k ) .

  14. 13. Procedure: 1. Initialize the parameters of the model µ randomly. 2. While l ( C | µ ) is improving, repeat a. E-step: estimate the (posterior) probability that the sense s k generated the context c i : P ( c i | s k ) h ik = Σ K k =1 P ( c i | s k ) b. M-step: re-estimate the parameters P ( w j | s k ) , P ( s k ) , by way of MLE: Σ { ci | wj in ci } h ik Σ I i =1 h ik = Σ I i =1 h ik i =1 h ik P ( w j | s k ) = and P ( s k ) = Σ K k =1 Σ I Z k I where Z k is a normalizing constant: Z k = Σ K k =1 Σ { c i | w j in c i } h ik .

  15. 14. 2.2 Constraint-based WSD: “One sense per discourse, one sense per collocation” [ Yarowsky, 1995 ] One sense per discourse: The sense of a target word is highly consistent within any given document. One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order, and syntactic relationship.

  16. Yarowsky Algorithm: 15. WSD by constraint propagation comment: Initialization 1 for all senses s k of w do 2 F k = the set of features (words) in s k dictionary definition 3 E k = ∅ 4 end 5 comment: one sense per discourse 6 while (at least one E k changed in the last iteration) do 7 for all senses s k of w do 8 comment: identify the contexts c i bearing the sense s k 9 E k = { c i | ∃ f m ∈ F k : f m ∈ c i } 10 end 11 for all senses s k of w do 12 comment: retain the features f m which best express sense s k 13 F k = { f m | ∀ n � = k P ( s k | f m ) C ( f m ,s i ) P ( s n | f m ) > α } where P ( s i | f m ) = 14 Σ j C ( f m ,s j ) end 15 end 16

  17. 16. Yarowsky Algorithm (Cont’d) comment: one sense per collocation 17 determine the majority sense s k of w in the document d : 18 Σ m ∈ Fi C ( f m ,s i ) s k = argmax s i P ( s i ) P ( s i ) = where 19 Σ j Σ m ∈ Fi C ( f m ,s j ) assign all occurrences of w in the document d the sense s k 20 end 21

  18. 17. 2.3 Resource-based WSD 2.3.1 Dictionary-based WSD Example Two senses of ash : sense definition s 1 : tree D 1 : a tree of the olive family s 2 : burned stuff D 2 : the solid residue left when combustible material is burned Disambiguation of ash using Lesk’s algorithm (see next slide): scores context s 1 s 2 This cigar burns slowly and creates a stiff ash . 0 1 The ash is one of the last trees to come into leaf. 1 0

  19. 18. Dictionary-based WSD: Lesk’s algorithm comment: training 1 for all senses s k of w do 2 score( s k ) = overlap( D k , ∪ v j in c E v j ) 3 end 4 comment: Disambiguation 5 choose s ′ = argmax s k score( s k ) 6 where: D k is the set of words occurring in the dictionary def- inition of the sense s k for w ; E v j is the set of words occurring in the dictionary def- inition of v j (the union of all its sense definitions)

  20. 19. 2.3.2 Thesaurus-based WSD Example Word Sense Thesaurus category (Roget) bass musical senses music fish animal, insect star space object universe celebrity entertainer star shaped object insignia interest curiosity reasoning advantage injustice financial debt share property

  21. 20. Thesaurus-based WSD: Walker’s algorithm comment: Given: context c 1 for all senses s k of w do 2 score( s k ) = Σ w j in c δ ( t ( s k ) , w j ) 3 comment: score = #words compatible with the category of s k 1 end 5 comment: Disambiguation 6 choose s ′ = argmax s k score( s k ) 7 where: t ( s k ) is the thesaurus category of the sense s k ; δ ( t ( s k ) , w j ) = 1 iff t ( s k ) is one of the thesaurus categories for w j and 0 otherwise.

Recommend


More recommend