Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . final re-training (no threshold) train DL test 8
Example decision list for the named entity task Rank Score Feature Label 1 0.999900 New-York loc. 2 0.999900 California loc. 3 0.999900 U.S. loc. 4 0.999900 Microsoft org. 5 0.999900 I.B.M. org. 6 0.999900 Incorporated org. 7 0.999900 Mr. per. 8 0.999976 U.S. loc. 9 0.999957 New-York-Stock-Exchange loc. 10 0.999952 California loc. 11 0.999947 New-York loc. 12 0.999946 loc. court-in 13 0.975154 loc. Company-of . . . Context features are indicated by italics ; all others are spelling features. Seed rules are indicated by bold ranks. 9
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1800 1 1600 0.8 1 rule 1400 DL size | Num. labelled train examples 1.0 context: served 1200 0.6 Test accuracy 1000 1 rule 800 1.0 context: reads 0.4 600 400 0.2 train 200 4 4 0 0 0 5 10 15 20 0 Iteration Iteration 0 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 test accuracy .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 854 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .995 context: serv* 0.8 1400 .995 context: prison* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 214 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .976 context: reads 0.2 .969 context: 11* 200 . . . 0 0 0 5 10 15 20 0 1 2 Iteration train 238 56 Iteration 2 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1520 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .960 context: life* 0.8 1400 .960 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 223 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 Iteration train 242 49 Iteration 3 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .996 context: life* 0.8 1400 .996 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 221 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 4 Iteration train 247 49 Iteration 4 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 Iteration train 247 49 Iteration 5 10
Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 6 Iteration train 247 49 Iteration 6 10
Performance Yarowsky 81 . 49 % clean non-seeded accuracy (named entity) 11
Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) 12
Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) Co-training needs two views, eg: ◮ adjacent words { next word: a, next word: about, next word: according, . . . } ◮ context words { context: abolition, context: abundantly, context: accepting, . . . } 12
Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) 13
Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) With Yarowsky we can exploit type-level information in the DL 13
Vs. EM EM Expected counts on data: x 1 x 2 x 3 x 4 x 5 . . . Probabilities on features: f 1 f 2 f 3 f 4 f 5 . . . 14
Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . f 4 f 5 . . . 14
Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . Trimmed DL: f 4 f 1 f 5 f 3 . f 5 . . . . . 14
Cautiousness Can we improve decision list trimming? 15
Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration 15
Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration ◮ Yarowsky-cautious ◮ DL-CoTrain cautious 15
Yarowsky-cautious algorithm (Collins and Singer, 1999) 400 1 350 0.8 1 rule DL size | Num. labelled train examples 300 1.0 context: served 250 0.6 Test accuracy 1 rule 200 1.0 context: reads 0.4 150 100 0.2 train 50 4 4 0 0 0 10 20 30 40 50 60 0 Iteration Iteration 0 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 6 rules 1.0 context: served 400 1 .976 context: serv* .976 context: served 350 .955 context: inmat* 0.8 .955 context: releas* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 6 rules 200 1.0 context: reads 0.4 150 .976 context: read* .976 context: reads 100 .969 next: read* 0.2 .969 next: reads 50 . . . 0 0 0 10 20 30 40 50 60 0 1 Iteration train 25 12 Iteration 1 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 11 rules 1.0 context: served 400 1 .995 context: serv* .989 context: serve 350 .986 context: serving 0.8 .984 context: life* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 11 rules 200 1.0 context: reads 0.4 150 .991 context: read* .984 context: read 100 .976 context: reads 0.2 .969 next: from* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 Iteration train 62 20 Iteration 2 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 16 rules 1.0 context: served 400 1 .996 context: life* .996 context: life 350 .995 context: serv* 0.8 .995 context: prison* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 16 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .984 context: read 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 Iteration train 84 32 Iteration 3 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 21 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 21 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 Iteration train 100 36 Iteration 4 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 26 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 26 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 Iteration train 114 40 Iteration 5 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 31 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 31 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 Iteration train 128 40 Iteration 6 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 36 rules 1.0 context: served 400 1 .965 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 36 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 Iteration train 139 40 Iteration 7 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 41 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 41 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 Iteration train 139 48 Iteration 8 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 46 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 46 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 Iteration train 139 51 Iteration 9 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 51 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 51 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 Iteration train 146 53 Iteration 10 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 56 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 56 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 Iteration train 156 54 Iteration 11 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 61 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 61 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 Iteration train 159 57 Iteration 12 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 66 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 66 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Iteration train 159 58 Iteration 13 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 71 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 71 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Iteration train 163 58 Iteration 14 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 76 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 76 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration train 165 58 Iteration 15 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 81 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 81 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Iteration train 166 58 Iteration 16 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 86 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 86 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Iteration train 169 58 Iteration 17 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 91 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 91 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Iteration train 170 58 Iteration 18 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 96 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 96 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Iteration train 170 58 Iteration 19 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16
Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16
Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent 17
Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well 17
Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important 17
Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important ◮ Yarowsky does not need views 17
Did we really do EM right? 18
Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) 18
Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) Multiple runs of EM. Variance of results: ◮ EM: ± .34 ◮ Hard EM: ± 2.53 ◮ Online EM: ± .45 ◮ Hard Online EM: ± .68 18
Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis 19
Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) 19
Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work 19
Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) peaked for labelled example x uniform for unlabelled example x 19
Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing ◮ next: said 19 . . .
Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing alternative: arg max j � f ∈ F x θ f ( j ) ◮ next: said 19 . . .
Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) 20
Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20
Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20
Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20
Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | algorithm: fix one side, update other 20
Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i 21
Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i 21
Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 21
Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges ◮ Reduce uncertainty in the labelling distribution while respecting the labeled data 21
Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = 22
Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) 22
Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) ◮ The Generalized Objective Function: � K ψ ( φ, θ ) = B ψ ( θ f i || φ x j )+ H ψ ( θ f i )+ H ψ ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 22
Generalized Objective Function ψ ψ (q(i)) + ψ ’(q(i))(p’(i) - q(i)) a’ - b’ a’ ψ (p’(i)) ψ (p(i)) a - b ψ (q(i)) + ψ ’(q(i))(p(i) - q(i)) b’ a = ψ (p(i))- ψ (q(i)) b = ψ ’(q(i)) (p(i) - q(i)) ψ (q(i)) p(i)-q(i) p’(i)-q(i) 0 q(i) p(i) p’(i) 1 23
Variants from (Abney, 2004; Haffari and Sarkar, 2007) Yarowsky non-cautious 81 . 49 Yarowsky-cautious 89 . 97 Yarowsky-cautious sum 90 . 49 HaffariSarkar-bipartite avg-maj 79 . 69 % clean non-seeded accuracy (named entity) 24
Graph-based Propagation (Subramanya et al., 2010) Self-training with CRFs: label data graph propagate get types train CRF get posteriors seed data 25
Recommend
More recommend