sfu natlanglab
play

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - PowerPoint PPT Presentation

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser University Natural Language Laboratory http://natlang.cs.sfu.ca Bootstrapping Semi-supervised (vs supervised) Single domain (vs domain


  1. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8

  2. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8

  3. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . final re-training (no threshold) train DL test 8

  4. Example decision list for the named entity task Rank Score Feature Label 1 0.999900 New-York loc. 2 0.999900 California loc. 3 0.999900 U.S. loc. 4 0.999900 Microsoft org. 5 0.999900 I.B.M. org. 6 0.999900 Incorporated org. 7 0.999900 Mr. per. 8 0.999976 U.S. loc. 9 0.999957 New-York-Stock-Exchange loc. 10 0.999952 California loc. 11 0.999947 New-York loc. 12 0.999946 loc. court-in 13 0.975154 loc. Company-of . . . Context features are indicated by italics ; all others are spelling features. Seed rules are indicated by bold ranks. 9

  5. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1800 1 1600 0.8 1 rule 1400 DL size | Num. labelled train examples 1.0 context: served 1200 0.6 Test accuracy 1000 1 rule 800 1.0 context: reads 0.4 600 400 0.2 train 200 4 4 0 0 0 5 10 15 20 0 Iteration Iteration 0 10

  6. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  7. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  8. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  9. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 test accuracy .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  10. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 854 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .995 context: serv* 0.8 1400 .995 context: prison* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 214 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .976 context: reads 0.2 .969 context: 11* 200 . . . 0 0 0 5 10 15 20 0 1 2 Iteration train 238 56 Iteration 2 10

  11. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1520 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .960 context: life* 0.8 1400 .960 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 223 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 Iteration train 242 49 Iteration 3 10

  12. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .996 context: life* 0.8 1400 .996 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 221 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 4 Iteration train 247 49 Iteration 4 10

  13. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 Iteration train 247 49 Iteration 5 10

  14. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 6 Iteration train 247 49 Iteration 6 10

  15. Performance Yarowsky 81 . 49 % clean non-seeded accuracy (named entity) 11

  16. Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) 12

  17. Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) Co-training needs two views, eg: ◮ adjacent words { next word: a, next word: about, next word: according, . . . } ◮ context words { context: abolition, context: abundantly, context: accepting, . . . } 12

  18. Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) 13

  19. Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) With Yarowsky we can exploit type-level information in the DL 13

  20. Vs. EM EM Expected counts on data: x 1 x 2 x 3 x 4 x 5 . . . Probabilities on features: f 1 f 2 f 3 f 4 f 5 . . . 14

  21. Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . f 4 f 5 . . . 14

  22. Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . Trimmed DL: f 4 f 1 f 5 f 3 . f 5 . . . . . 14

  23. Cautiousness Can we improve decision list trimming? 15

  24. Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration 15

  25. Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration ◮ Yarowsky-cautious ◮ DL-CoTrain cautious 15

  26. Yarowsky-cautious algorithm (Collins and Singer, 1999) 400 1 350 0.8 1 rule DL size | Num. labelled train examples 300 1.0 context: served 250 0.6 Test accuracy 1 rule 200 1.0 context: reads 0.4 150 100 0.2 train 50 4 4 0 0 0 10 20 30 40 50 60 0 Iteration Iteration 0 16

  27. Yarowsky-cautious algorithm (Collins and Singer, 1999) 6 rules 1.0 context: served 400 1 .976 context: serv* .976 context: served 350 .955 context: inmat* 0.8 .955 context: releas* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 6 rules 200 1.0 context: reads 0.4 150 .976 context: read* .976 context: reads 100 .969 next: read* 0.2 .969 next: reads 50 . . . 0 0 0 10 20 30 40 50 60 0 1 Iteration train 25 12 Iteration 1 16

  28. Yarowsky-cautious algorithm (Collins and Singer, 1999) 11 rules 1.0 context: served 400 1 .995 context: serv* .989 context: serve 350 .986 context: serving 0.8 .984 context: life* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 11 rules 200 1.0 context: reads 0.4 150 .991 context: read* .984 context: read 100 .976 context: reads 0.2 .969 next: from* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 Iteration train 62 20 Iteration 2 16

  29. Yarowsky-cautious algorithm (Collins and Singer, 1999) 16 rules 1.0 context: served 400 1 .996 context: life* .996 context: life 350 .995 context: serv* 0.8 .995 context: prison* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 16 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .984 context: read 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 Iteration train 84 32 Iteration 3 16

  30. Yarowsky-cautious algorithm (Collins and Singer, 1999) 21 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 21 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 Iteration train 100 36 Iteration 4 16

  31. Yarowsky-cautious algorithm (Collins and Singer, 1999) 26 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 26 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 Iteration train 114 40 Iteration 5 16

  32. Yarowsky-cautious algorithm (Collins and Singer, 1999) 31 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 31 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 Iteration train 128 40 Iteration 6 16

  33. Yarowsky-cautious algorithm (Collins and Singer, 1999) 36 rules 1.0 context: served 400 1 .965 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 36 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 Iteration train 139 40 Iteration 7 16

  34. Yarowsky-cautious algorithm (Collins and Singer, 1999) 41 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 41 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 Iteration train 139 48 Iteration 8 16

  35. Yarowsky-cautious algorithm (Collins and Singer, 1999) 46 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 46 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 Iteration train 139 51 Iteration 9 16

  36. Yarowsky-cautious algorithm (Collins and Singer, 1999) 51 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 51 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 Iteration train 146 53 Iteration 10 16

  37. Yarowsky-cautious algorithm (Collins and Singer, 1999) 56 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 56 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 Iteration train 156 54 Iteration 11 16

  38. Yarowsky-cautious algorithm (Collins and Singer, 1999) 61 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 61 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 Iteration train 159 57 Iteration 12 16

  39. Yarowsky-cautious algorithm (Collins and Singer, 1999) 66 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 66 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Iteration train 159 58 Iteration 13 16

  40. Yarowsky-cautious algorithm (Collins and Singer, 1999) 71 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 71 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Iteration train 163 58 Iteration 14 16

  41. Yarowsky-cautious algorithm (Collins and Singer, 1999) 76 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 76 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration train 165 58 Iteration 15 16

  42. Yarowsky-cautious algorithm (Collins and Singer, 1999) 81 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 81 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Iteration train 166 58 Iteration 16 16

  43. Yarowsky-cautious algorithm (Collins and Singer, 1999) 86 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 86 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Iteration train 169 58 Iteration 17 16

  44. Yarowsky-cautious algorithm (Collins and Singer, 1999) 91 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 91 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Iteration train 170 58 Iteration 18 16

  45. Yarowsky-cautious algorithm (Collins and Singer, 1999) 96 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 96 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Iteration train 170 58 Iteration 19 16

  46. Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16

  47. Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16

  48. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent 17

  49. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well 17

  50. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important 17

  51. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important ◮ Yarowsky does not need views 17

  52. Did we really do EM right? 18

  53. Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) 18

  54. Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) Multiple runs of EM. Variance of results: ◮ EM: ± .34 ◮ Hard EM: ± 2.53 ◮ Online EM: ± .45 ◮ Hard Online EM: ± .68 18

  55. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis 19

  56. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) 19

  57. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work 19

  58. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) peaked for labelled example x uniform for unlabelled example x 19

  59. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing ◮ next: said 19 . . .

  60. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing alternative: arg max j � f ∈ F x θ f ( j ) ◮ next: said 19 . . .

  61. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) 20

  62. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

  63. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

  64. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

  65. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | algorithm: fix one side, update other 20

  66. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i 21

  67. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i 21

  68. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 21

  69. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges ◮ Reduce uncertainty in the labelling distribution while respecting the labeled data 21

  70. Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = 22

  71. Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) 22

  72. Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) ◮ The Generalized Objective Function: � K ψ ( φ, θ ) = B ψ ( θ f i || φ x j )+ H ψ ( θ f i )+ H ψ ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 22

  73. Generalized Objective Function ψ ψ (q(i)) + ψ ’(q(i))(p’(i) - q(i)) a’ - b’ a’ ψ (p’(i)) ψ (p(i)) a - b ψ (q(i)) + ψ ’(q(i))(p(i) - q(i)) b’ a = ψ (p(i))- ψ (q(i)) b = ψ ’(q(i)) (p(i) - q(i)) ψ (q(i)) p(i)-q(i) p’(i)-q(i) 0 q(i) p(i) p’(i) 1 23

  74. Variants from (Abney, 2004; Haffari and Sarkar, 2007) Yarowsky non-cautious 81 . 49 Yarowsky-cautious 89 . 97 Yarowsky-cautious sum 90 . 49 HaffariSarkar-bipartite avg-maj 79 . 69 % clean non-seeded accuracy (named entity) 24

  75. Graph-based Propagation (Subramanya et al., 2010) Self-training with CRFs: label data graph propagate get types train CRF get posteriors seed data 25

Recommend


More recommend