Light-Supervision of Structured Prediction Energy Networks Andrew McCallum SPENs Generalized Expectation Pedram Aishwarya [2016] [Mann; Druck 2010-12] Rooshenas Kamath David Belanger Greg Druck Oregon PhD → UMass Postdoc UMass MS UMass PhD → Google Brain UMass PhD → Yummly
Light-Supervision Prior Knowledge as Generalized Expectation …induces extra structural dependencies… Structured Prediction Complex dependencies with SPENs
Chapter 1 Generalized Expectation
Learning from small labeled data
Leverage unlabeled data
Family 1: Expectation Maximization [Dempster, Laird, Rubin, 1977]
Family 2: Graph-Based Methods [Szummer, Jaakkola, 2002] [Zhu, Ghahramani, 2002]
Family 3: Auxiliary-Task Methods [Ando and Zhang, 2005]
Family 4: Boundary in Sparse Region Transductive SVMs [Joachims, 1999]: Sparsity measured by margin Entropy Regularization [Grandvalet & Bengio, 2005]: minimize label entropy
Family 4: Boundary in Sparse Region Family 5: Generalized Expectation Criteria [Mann, McCallum 2010; Druck, Mann, McCallum 2011, Druck McCallum 2012] Transductive SVMs [Joachims, 1999]: Sparsity measured by margin Entropy Regularization [Grandvalet & Bengio, 2005]: minimize label entropy best solution? Label Proportions Student Faculty Label | Feature Label Prior 100 Expectations Expectations E [ p(y|f(x)) ] E [ p(y) ] 50 0
Expectations on Labels | Features Classifying Baseball versus Hockey Generalized Traditional Expectation Human Brainstorm Labeling a few Effort Keywords p(HOCKEY | “puck”) = .9 ball puck field ice bat stick (Semi-)Supervised Training via Semi-Supervised Training via Maximum Likelihood Generalized Expectation
Labeling Features ~1000 unlabeled examples features labeled . . . hockey goal ball batting Edmonton Oilers baseball Buffalo Oilers base Toronto Maple HR Leafs Leafs Sox NHL Pittsburgh Mets puck Pens Bruins Penguins Lemieux runs Penguins Accuracy 85% 92% 94.5% 96%
Accuracy per Human E ff ort Labeling features Test accuracy Labeling instances Labeling time in seconds
Prior Knowledge Feature labels from humans baseball / hockey classification baseball hockey hit puck braves goal runs nhl many other sources resources on the web --- --- --- --- --- --- data from related tasks ----- -- ----- -- ----- -- -- --- -- --- -- --- W. H. Enright. Improving the efficiency of matrix operations in the numerical solution of stiff --- --- --- --- --- --- ordinary differential equations. ACM Trans. Math. ----- -- ----- -- ----- -- -- --- -- --- -- --- Softw., 4(2), 127-136, June 1978.
Generalized Expectation (GE) input variables output variables O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + p ( θ ) constraint features returns 1 if x contains “hit” and y is baseball
Generalized Expectation (GE) assume general CRF [Lafferty et al. 01] 1 θ > f ( x , y ) � � p ( y | x ; θ ) = exp Z θ , x O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + p ( θ ) model features model distribution model probability of baseball if x contains “hit”
Generalized Expectation (GE) O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + p ( θ ) empirical distribution (can be defined as) model’s probability that documents that contain “hit” are labeled baseball
Generalized Expectation (GE) (soft) expectation constraint O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + p ( θ ) score function larger score if model expectation matches prior knowledge
Generalized Expectation (GE) Objective Function O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + r ( θ ) regularization
GE Score Functions O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + r ( θ ) model expectations target expectations ˆ g = g θ = � 2 � �� � ˆ � �� 2 ( θ ) = − S l 2 squared error: g − g θ 2 target expectations model expectations } “puck” g = ˆ g θ = } “hit” g q log ˆ g q X KL divergence: S KL ( θ ) = − ˆ g θ ,q q
Estimating Parameters with GE O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + r ( θ ) v i = ˆ g i v i = − 2(ˆ g i − g θ i ) violation term: sq. error: KL: g θ i ⇥ θ O ( θ ) = v � ⇣ p ( x ) [E p ( y | x ; θ ) [ g ( x , y ) f ( x , y ) � ] E ˜ violation ⌘ � E p ( y | x ; θ ) [ g ( x , y )]E p ( y | x ; θ ) [ f ( x , y ) � ]] + ⇥ θ r ( θ ) estimated covariance between model and constraint features
Learning About Unconstrained Features Trained Model generalizes beyond hit prior knowledge run hit pitcher unlabeled puck GE learned through puck covariance model goal NHL
Generalized Expectation criteria Easy communication with domain experts • Inject domain knowledge into parameter estimation • Like “informative prior”... • ...but rather than the “language of parameters” (di ffi cult for humans to understand) • ...use the “language of expectations” (natural for humans)
IID Structured Prediction “classification” e.g. logistic regression Example: Spam Filtering Predicted Not Not Spam Spam Y Spam Spam Observed @ @ @ @ X
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” Chinese Word Segmentation 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, O ( θ ) = S (E ˜ p ( x ) [E p ( y | x ; θ ) [ g ( x , y )]]) + r ( θ ) 其他共和黨總統 參選⼈亻⺫⽬盯前均表 Linear-chain CRF GE v > X X X p ( y i � 1 , y i , y j | x ; θ ) g ( x , y j , j ) f ( x , y i � 1 , y i , i ) > Gradient marginal over three, i j y non-consecutive positions Not Not Y Start Start Start Start 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Natural Expectations lead to Di ffi cult Training Inference “AUTHOR field should be contiguous, only appearing once.” AUTHOR AUTHOR Anna Popescu (2004), “Interactive Clustering,” EDITOR EDITOR Wei Li (Ed.), Learning Handbook, Athos Press, LOCATION Souroti. The downfall of GE. p(y i-1 , y i , y j , y k )
Chapter 2 A framework providing easier inference for complex dependencies? Structured Prediction Energy Networks Deep Learning + Structured Prediction
Structured Prediction “classification” e.g. logistic regression Example: Spam Filtering Predicted Not Not Spam Spam Y Spam Spam =argmin Y E (Y;X)= 횺 Factor Factor Factor Factor Observed @ @ @ @ X
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, Example: Chinese Word Segmentation 其他共和黨總統 參選⼈亻⺫⽬盯前均表 E (Y,Y) Not Not Y Start Start Start Start =argmin Y E (Y;X) 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, Example: Chinese Word Segmentation 其他共和黨總統 參選⼈亻⺫⽬盯前均表 E (Y,Y) Not Not Y Start Start Start Start E (Y;X) Feature Engineering 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, Example: Chinese Word Segmentation 其他共和黨總統 參選⼈亻⺫⽬盯前均表 E (Y,Y) Not Not Y Start Start Start Start E (Y;X) Feature Engineering 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, Example: Chinese Word Segmentation 其他共和黨總統 參選⼈亻⺫⽬盯前均表 E (Y,Y) “Hidden Unit Conditional Random Fields” Maaten, Welling, Saul, AISTATS 2011 Not Not Y Start Start Start Start E (Y,Z;X) Feature Engineering Z 1 Z 2 Z 3 Z 4 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, Example: Chinese Word Segmentation 其他共和黨總統 參選⼈亻⺫⽬盯前均表 E (Y,Y) Not Not Y Start Start Start Start E (Y,Z;X) Feature Engineering Z 1 Z 2 Z 3 Z 4 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Structured Prediction 羅穆尼頭號對⼿扌 桑托倫在三州勝 選,⽽耍⾦釒瑞契只 e.g. “sequence labeling” 贏得喬治亞州的 初選。羅穆尼⾯靣 臨的⼀丁⼤夨挑戰是, Example: Chinese Word Segmentation 其他共和黨總統 參選⼈亻⺫⽬盯前均表 E (Y,Y) Dependency structure Not Not Y Start Start Start Start E (Y,Z;X) Feature Engineering Z 1 Z 2 Z 3 Z 4 中 国 ⼈亻 民 X C h i n e s e P e o p l e
Recommend
More recommend