discriminative training
play

Discriminative Training February 19, 2013 Tuesday, February 19, 13 - PowerPoint PPT Presentation

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e ) English source Tuesday, February 19, 13 Noisy Channels Again p ( g | e ) p ( e ) English source German Tuesday, February 19, 13 Noisy


  1. Discriminative Training February 19, 2013 Tuesday, February 19, 13

  2. Noisy Channels Again p ( e ) English source Tuesday, February 19, 13

  3. Noisy Channels Again p ( g | e ) p ( e ) English source German Tuesday, February 19, 13

  4. Noisy Channels Again p ( g | e ) p ( e ) English source German decoder e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e Tuesday, February 19, 13

  5. Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e Tuesday, February 19, 13

  6. Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e = arg max log p ( g | e ) + log p ( e ) e  1 � >  log p ( g | e ) � = arg max 1 log p ( e ) e | {z } | {z } w > h ( g , e ) Tuesday, February 19, 13

  7. Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e = arg max log p ( g | e ) + log p ( e ) e  1 � >  log p ( g | e ) � = arg max 1 log p ( e ) e | {z } | {z } w > h ( g , e ) Tuesday, February 19, 13

  8. Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e Does this look familiar? = arg max log p ( g | e ) + log p ( e ) e  1 � >  log p ( g | e ) � = arg max 1 log p ( e ) e | {z } | {z } w > h ( g , e ) Tuesday, February 19, 13

  9. The Noisy Channel -log p ( g | e ) -log p ( e ) Tuesday, February 19, 13

  10. As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

  11. As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

  12. As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

  13. As a Linear Model -log p ( g | e ) Improvement 1: change to find better translations ~ ~ w w g -log p ( e ) Tuesday, February 19, 13

  14. As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

  15. As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

  16. As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

  17. As a Linear Model -log p ( g | e ) Improvement 2: Add dimensions to make points separable ~ w -log p ( e ) Tuesday, February 19, 13

  18. Linear Models e ⇤ = arg max w > h ( g , e ) e • Improve the modeling capacity of the noisy channel in two ways • Reorient the weight vector • Add new dimensions ( new features ) • Questions • What features? h ( g , e ) • How do we set the weights? w Tuesday, February 19, 13

  19. beißt Hund Mann 18 Tuesday, February 19, 13

  20. beißt Hund Mann x BITES y 18 Tuesday, February 19, 13

  21. beißt Hund Mann x BITES y Mann beißt Hund Mann beißt Hund bites cat man chase dog man Mann beißt Hund Mann beißt Hund bite cat bite dog man man Mann beißt Hund Mann beißt Hund dog bites man bites dog man 19 Tuesday, February 19, 13

  22. beißt Hund Mann x BITES y Mann beißt Hund Mann beißt Hund bites cat man chase dog man Mann beißt Hund Mann beißt Hund bite cat bite dog man man Mann beißt Hund Mann beißt Hund dog bites man bites dog man 19 Tuesday, February 19, 13

  23. beißt Hund Mann x BITES y Mann beißt Hund Mann beißt Hund bites cat man chase dog man Mann beißt Hund Mann beißt Hund bite cat bite dog man man Mann beißt Hund Mann beißt Hund dog bites man bites dog man 19 Tuesday, February 19, 13

  24. beißt Hund Mann x BITES y Mann beißt Hund Mann beißt Hund bites cat man chase dog man Mann beißt Hund Mann beißt Hund bite cat bite dog man man Mann beißt Hund Mann beißt Hund dog bites man bites dog man 19 Tuesday, February 19, 13

  25. beißt Hund Mann x BITES y Mann beißt Hund Mann beißt Hund bites cat man chase dog man Mann beißt Hund Mann beißt Hund bite cat bite dog man man Mann beißt Hund Mann beißt Hund dog bites man bites dog man 19 Tuesday, February 19, 13

  26. Feature Classes Lexical Are lexical choices appropriate? bank = “River bank” vs. “Financial institution” 20 Tuesday, February 19, 13

  27. Feature Classes Lexical Are lexical choices appropriate? bank = “River bank” vs. “Financial institution” Configurational Are semantic/syntactic relations preserved? “Dog bites man” vs. “Man bites dog” 20 Tuesday, February 19, 13

  28. Feature Classes Lexical Are lexical choices appropriate? bank = “River bank” vs. “Financial institution” Configurational Are semantic/syntactic relations preserved? “Dog bites man” vs. “Man bites dog” Grammatical Is the output fluent / well-formed? “Man bites dog” vs. “Man bite dog” 20 Tuesday, February 19, 13

  29. What do lexical features look like? Mann beißt Hund bites cat man 21 Tuesday, February 19, 13

  30. What do lexical features look like? Mann beißt Hund bites cat man 21 Tuesday, February 19, 13

  31. What do lexical features look like? Mann beißt Hund bites cat man First attempt: score ( g , e ) = w > h ( g , e ) ( 1 , ∃ i, j : g i = Hund , e j = cat h 15 , 342 ( g , e ) = 0 , otherwise 21 Tuesday, February 19, 13

  32. What do lexical features look like? Mann beißt Hund bites cat man First attempt: score ( g , e ) = w > h ( g , e ) ( 1 , ∃ i, j : g i = Hund , e j = cat h 15 , 342 ( g , e ) = 0 , otherwise But what if a cat is being chased by a Hund ? 21 Tuesday, February 19, 13

  33. What do lexical features look like? Mann beißt Hund bites cat man Latent variables enable more precise features: score ( g , e , a ) = w > h ( g , e , a ) ( 1 , if g i = Hund , e j = cat X h 15 , 342 ( g , e , a ) = 0 , otherwise ( i,j ) 2 a 22 Tuesday, February 19, 13

  34. Standard Features • Target side features • log p(e) [ n -gram language model ] • Number of words in hypothesis • Non-English character count • Source + target features • log relative frequency e | f of each rule [ log #(e,f) - log #(f) ] • log relative frequency f | e of each rule [ log #(e,f) - log #(e) ] • “lexical translation” log probability e | f of each rule [ ≈ log p model1 (e|f) ] • “lexical translation” log probability f | e of each rule [ ≈ log p model1 (f|e) ] • Other features • Count of rules/phrases used • Reordering pattern probabilities 23 Tuesday, February 19, 13

  35. Parameter Learning 24 Tuesday, February 19, 13

  36. Hypothesis Space h 1 h 2 25 Tuesday, February 19, 13

  37. Hypothesis Space h 1 Hypotheses h 2 25 Tuesday, February 19, 13

  38. Hypothesis Space h 1 References h 2 26 Tuesday, February 19, 13

  39. Preliminaries We assume a decoder that computes: h e ⇤ , a ⇤ i = arg max h e , a i w > h ( g , e , a ) And K -best lists of, that is: { � e ⇤ i , a ⇤ h e , a i w > h ( g , e , a ) i ⇥ } i = K i =1 = arg i th - max Standard, efficient algorithms exist for this. 27 Tuesday, February 19, 13

  40. Learning Weights • Try to match the reference translation exactly • Conditional random field • Maximize the conditional probability of the reference translations • “Average” over the different latent variables • Max-margin • Find the weight vector that separates the reference translation from others by the maximal margin • Maximal setting of the latent variables 28 Tuesday, February 19, 13

  41. Learning Weights • Try to match the reference translation exactly • Conditional random field • Maximize the conditional probability of the reference translations • “Average” over the different latent variables • Max-margin • Find the weight vector that separates the reference translation from others by the maximal margin • Maximal setting of the latent variables 28 Tuesday, February 19, 13

  42. Problems • These methods give “full credit” when the model exactly produces the reference and no credit otherwise • What is the problem with this? • There are many ways to translate a sentence • What if we have multiple reference translations? • What about partial credit? 29 Tuesday, February 19, 13

  43. Problems • These methods give “full credit” when the model exactly produces the reference and no credit otherwise • What is the problem with this? • There are many ways to translate a sentence • What if we have multiple reference translations? • What about partial credit? 29 Tuesday, February 19, 13

  44. Cost-Sensitive Training • Assume we have a cost function that gives a score for how good/bad a translation is � (ˆ e , E ) ⇥� [0 , 1] • Optimize the weight vector by making reference to this function • We will talk about two ways to do this 30 Tuesday, February 19, 13

  45. K-Best List Example h 1 ~ w h 2 31 Tuesday, February 19, 13

  46. K-Best List Example #2#1 h 1 #3 ~ w #6 #5 #4 #7 #8 #9 #10 h 2 31 Tuesday, February 19, 13

Recommend


More recommend