SLIDE 1 Hidden–Variable Models for Discriminative Reranking
Terry Koo and Michael Collins
{maestro|mcollins}@csail.mit.edu
SLIDE 2
Overview of reranking
The reranking approach
Use a baseline model to get the N-best candidates “Rerank” the candidates using a more complex model
Parse reranking
Collins (2000): 88.2% ⇒ 89.8% Charniak and Johnson (2005): 89.7% ⇒ 91.0% Talk by Brooke Cowan in 7B: 83.6% ⇒ 85.1%
Also applied to
MT (Och and Ney, 2002; Shen et al., 2004) NL Generation (Walker et al., 2001)
SLIDE 3
Representing NLP structures
Proper representation is critical to success Hand–crafted feature vector representations
Φ( ) = {0, 1, 2, 0, 0, 3, 0, 1}
Features defined through kernels
K( , ) = Φ( )·Φ( )
This talk: A new approach using hidden variables
SLIDE 4
Two facets of lexical items
Different lexical items can have similar meanings, e.g. president and chairman
Clustering: president, chairman ∈ NounCluster4
A single lexical item can have different meanings, e.g. [river] bank vs [financial] bank
Refinement: bank 1, bank 2 ∈ bank
Model clusterings and refinements as hidden variables that support the reranking task
SLIDE 5
Highlights of the approach
Conditional log–linear model with hidden variables Dynamic programming is used for training and decoding Clustering and refinement done automatically using a discriminative criterion
SLIDE 6
Overview of talk
Motivation Design
General form of the model Training and decoding efficiently Creating specific instantiations
Results Discussion Conclusion
SLIDE 7
The parse reranking framework
Sentences si for 1 ≤ i ≤ n
s1: Pierre Vinken , 61 years old , will join ... s2: Mr. Vinken is chairman of Elsevier N.V. ... s3: Big Board Chairman John Phelan said yesterday ...
Each si has candidate parses ti,j for 1 ≤ j ≤ ni
ti,1 is the best candidate parse for si
SLIDE 8 The parse reranking framework
ti,j has phrase structure and dependency tree
Mr. Vinken is chairman
Elsevier N.V. NNP NP NP PP NP VP S NP NNP NNP VB NN IN NNP
SLIDE 9 The parse reranking framework
ti,j has phrase structure and dependency tree
Vinken is chairman
Elsevier N.V. Mr. NNP VB NN IN NNP NNP NP NP NP PP NP VP S NNP
SLIDE 10 Adding hidden variables
Hidden–value domains Hw(ti,j) for 1 ≤ w ≤ len(si)
1
NN NN
1 2
NN3 IN1 IN3 IN2 NNP
1
NNP
3
NNP
2
NNP NNP
2 3
NNP
Mr. Vinken is chairman
Elsevier N.V.
NNP
1
NNP
3
NNP
2
NNP NNP
2 3
NNP
1
VB VB
2 3
VB1
S NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP
SLIDE 11 Adding hidden variables
Assignment h ∈ H1(ti,j) × ... × Hlen(si)(ti,j)
NN NNP
1
NNP
3
NNP
2
NNP NNP
2 3
NNP
1
VB VB
2 3
VB1 NN
1 2
NN3 NNP NNP
2 3
NNP
1
NNP
1
NNP
3
NNP
2
IN1 IN3 IN2
Mr. Vinken is chairman
Elsevier N.V. NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP S
SLIDE 12 Marginalized probability model
Φ(ti,j, h) produces a descriptive vector of feature
Φ2(ti,j, h) = Count(chairman has hidden value NN1) Φ13(ti,j, h) = Count(NNP2 is a direct object of VB1) Φ19(ti,j, h) = Count(NN1 coordinates with NN2)
SLIDE 13 Marginalized probability model
Log–linear distribution over (ti,j, h) with parameters Θ:
p(ti,j, h | si, Θ) = eΦ(ti,j, h)·Θ
Marginalize over assignments h:
p(ti,j | si, Θ) =
h
p(ti,j, h | si, Θ)
SLIDE 14 Optimizing the parameters
Define loss as negative log-likelihood
L(Θ) = -
n
log p(ti,1 | si, Θ)
Minimize L(Θ) through gradient descent
∂L ∂Θ = -
i
p(h | ti,1, si, Θ)Φ(ti,1, h) +
i,j
p(ti,j | si, Θ)
h
p(h | ti,j, si, Θ)Φ(ti,j, h)
SLIDE 15
Overview of talk
Motivation Design
General form of the model Training and decoding efficiently Creating specific instantiations
Results Discussion Conclusion
SLIDE 16 Problems with efficiency
|H1(ti,j) × ... × Hlen(si)(ti,j)| grows exponentially, so training the model is intractable:
∂L ∂Θ = -
i
p(h | ti,1, si, Θ)Φ(ti,1, h) +
i,j
p(ti,j | si, Θ)
h
p(h | ti,j, si, Θ)Φ(ti,j, h)
Decoding the model is also intractable:
p(ti,j | si, Θ) =
h
p(ti,j, h | si, Θ)
SLIDE 17 Problems with efficiency
|H1(ti,j) × ... × Hlen(si)(ti,j)| grows exponentially, so training the model is intractable:
∂L ∂Θ = -
i
p(h | ti,1, si, Θ)Φ(ti,1, h) +
i,j
p(ti,j | si, Θ)
h
p(h | ti,j, si, Θ)Φ(ti,j, h)
Decoding the model is also intractable:
p(ti,j | si, Θ) =
h
p(ti,j, h | si, Θ)
SLIDE 18
Locality constraint on features
Features have pairwise local scope on hidden variables Features still have global scope on non-hidden information Φ can be factored into local feature vectors, allowing dynamic programming
SLIDE 19
Local feature vectors
Define two kinds of local feature vector φ:
Single-variable φ(ti,j, w, hw) look at a single hidden variable Pairwise φ(ti,j, u, v, hu, hv) look at two hidden variables in a dependency relationship
SLIDE 20 Local feature vectors
Φ(ti,j, h) looks at every hidden variable
NN NNP
1
NNP
3
NNP
2
NNP NNP
2 3
NNP
1
VB VB
2 3
VB1 NN
1 2
NN3 NNP NNP
2 3
NNP
1
NNP
1
NNP
3
NNP
2
IN1 IN3 IN2
Mr. Vinken is chairman
Elsevier N.V. NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP S
SLIDE 21 Local feature vectors
φ(ti,j, chairman, NN3) only sees NN3
3
NN NN
1 2
NN
Mr. Vinken is chairman
Elsevier N.V. NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP S
SLIDE 22 Local feature vectors
φ(ti,j, chairman, of, NN3, IN2) sees NN3 and IN2
1
IN3 IN2 NN NN
1
Mr. Vinken is chairman
Elsevier N.V.
2
NN3 IN
S NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP
SLIDE 23 Local feature vectors
Rewrite global Φ as a sum over local φ
Φ(ti,j, h) =
φ(ti,j, w, hw) +
φ(ti,j, u, v, hu, hv)
SLIDE 24 Local feature vectors
Rewrite global Φ as a sum over local φ
Φ(ti,j, h) =
φ(ti,j, w, hw) +
φ(ti,j, u, v, hu, hv)
SLIDE 25 Applying belief propagation
New restrictions enable dynamic–programming approaches, e.g. belief propagation
BP generalizes the forward–backward algorithm from a chain to a tree Runtime O(len(si)H2), H = max |Hw(ti,j)|
BP efficiently computes
p(ti,j, h | si, Θ)
p(h | ti,j, si, Θ)Φ(ti,j, h)
SLIDE 26
Overview of talk
Motivation Design
General form of the model Training and decoding efficiently Creating specific instantiations
Results Discussion Conclusion
SLIDE 27
Two areas for choice in the model
Definition of the hidden–value domains Hw(ti,j) Definition of the feature vectors φ
SLIDE 28 Hidden–value domains
Lexical domains allow word refinement
chairman Mr. Vinken is
Elsevier N.V.
chairman chairman Elsevier Elsevier Elsevier
1 2 3 1 2 3 1 3 1 2 3
Mr. Mr. Mr.
1 2 3 2
2 3
N.V. N.V. N.V.
1 2 3
Vinken Vinken Vinken is is is chairman
SLIDE 29 Hidden–value domains
Lexical domains allow word refinement
chairman Mr. Vinken is
Elsevier N.V.
Elsevier Elsevier
1 2 3 1 2 3 1 2 3
Mr. Mr. Mr.
1 2 3
chairman chairman1
2 3
chairman
2 3
N.V. N.V. N.V.
1 2 3
Vinken Vinken Vinken is is is Elsevier
SLIDE 30 Hidden–value domains
Part-of-speech domains allow word clustering
Elsevier Mr. Vinken is chairman
N.V.
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
VB VB VB VB VB NN NN NNP NN NN1
2 3 4 5 1 2 3 4 5
IN IN IN IN IN NN NNP NNP NNP NNP
1 2 3 4 5 1 2 3 4 5
NNP NNP NNP NNP NNP
SLIDE 31 Hidden–value domains
Part-of-speech domains allow word clustering
Elsevier Mr. Vinken is
N.V. chairman
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
VB VB VB VB VB
1 2
NNP
4 5
IN IN IN IN IN NN NN NN NN NN1
2 3 4 5 3
NNP NNP NNP NNP
1 2 3 4 5 1 2 3 4 5
NNP NNP NNP NNP NNP
SLIDE 32 Hidden–value domains
Part-of-speech domains allow word clustering
Elsevier is chairman
Mr. Vinken N.V.
1 2 3 4 5
IN IN IN IN IN
1 2 3 4 5
NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP
1 2 3 4 5 1 2 1 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
3 2 3 4 5
VB VB VB VB VB NN NN NN NN NN1
2 3 4 5
SLIDE 33 Hidden–value domains
Supersense domains model WordNet ontology
(Ciaramita and Johnson, 2003; Miller et at., 1993)
Vinken Elsevier N.V.
is chairman Mr.
3
noun.person noun.person noun.person
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 1 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
IN IN IN IN IN
2 2 3
verb.stative verb.stative verb.stative
1 2 3
verb.social verb.social verb.social
1 2 3
verb.possession verb.possession verb.possession
1 2
SLIDE 34 Hidden–value domains
Supersense domains model WordNet ontology
(Ciaramita and Johnson, 2003; Miller et at., 1993)
chairman
Elsevier is Mr. Vinken N.V.
3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP
1
NNP NNP
1 2 3 4 5
IN IN IN IN IN
1 2 3
noun.person noun.person noun.person NNP
2 3
verb.stative verb.stative verb.stative
1 2 3
verb.social verb.social verb.social
1 2 3
verb.possession verb.possession verb.possession
1 2
SLIDE 35 Hidden–value domains
Supersense domains model WordNet ontology
(Ciaramita and Johnson, 2003; Miller et at., 1993)
is Elsevier N.V.
chairman Mr. Vinken
5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
IN IN IN
1
IN
1 2 3
verb.possession verb.possession verb.possession
1 2 3
verb.social verb.social verb.social
1 2 3
verb.stative verb.stative verb.stative IN
2 3
noun.person noun.person noun.person
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4
SLIDE 36 Hidden–value domains
Supersense domains model WordNet ontology
(Ciaramita and Johnson, 2003; Miller et at., 1993)
Vinken Elsevier N.V.
is chairman Mr.
3
noun.person noun.person noun.person
1 2 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
NNP NNP NNP NNP NNP
1 1 3 4 5
NNP NNP NNP NNP NNP
1 2 3 4 5
IN IN IN IN IN
1 2 3 4 5
NNP NNP NNP NNP NNP
2 2 3
verb.stative verb.stative verb.stative
1 2 3
verb.social verb.social verb.social
1 2 3
verb.possession verb.possession verb.possession
1 2
SLIDE 37
Hidden–value domains
Hidden–value domains that didn’t work well
Word clustering without part-of-speech subdivisions WordNet hyper/hyponym ontology Domains containing mixed values
SLIDE 38 Examples of features
Elsevier headed by chairman The highest nonterminal
2
NN3 IN1 IN3 NNP
1
NNP
3
NNP NNP
2 3
NNP
1
IN2 NNP
2
NNP
1
Vinken
N.V. chairman
VB1
is Mr.
NNP
2
NP(chairman)
NNP
1
NNP
3
NNP NNP
2 3
VB VB
2 3
NN NN
1
S NNP IN NNP NNP NP NP PP VP VB NN NP NNP (NN3, Word=chairman, Highest Nonterminal=NP) ∈ φ(ti,j, chairman, NN3)
SLIDE 39 Examples of features
The governing rule
NN
1 2
NN3 IN1 IN3 NNP
1
NNP
3
NNP NNP
2 3
NNP
1
IN2 NNP
2
NNP
1
Vinken
Elsevier N.V. Mr.
NNP
2
VB1
is chairman
NNP
1
NNP
3
NNP NNP
2 3
VB VB
2 3
NN
NP NP PP NN NP NNP S NP VP VB NNP IN NNP NNP (VB1, NN3, Rule=VP →VB NP) ∈ φ(ti,j, is, chairman, VB1, NN3)
SLIDE 40
Overview of talk
Motivation Design Results Discussion Conclusion
SLIDE 41
Experimental Setup
N-best lists generated by Collins parser, ni ≈ 30 Training set: WSJ sections 2–21 Development set: WSJ section 0 Test set: WSJ sections 22-24
SLIDE 42
Final test models
Two baseline models
The Collins (1999) base parser The Collins (2000) reranker
Two mixed models
MIX combined clustering, refinement, and WordNet MIX+ augments MIX with features of Collins reranker
SLIDE 43
Results on Sections 22–24
LR LP F 1 Collins parser 88.19 88.60 88.39 MIX 89.41 89.87 89.64 Collins reranker 89.46 90.07 89.76 MIX+ 89.78 90.29 90.03
All comparisons except MIX vs. Collins reranker are significant at p ≤ 0.01 using the sign test
SLIDE 44
Results on Sections 22–24
LR LP F 1 Collins parser 88.19 88.60 88.39 MIX 89.41 89.87 89.64 Collins reranker 89.46 90.07 89.76 MIX+ 89.78 90.29 90.03
All comparisons except MIX vs. Collins reranker are significant at p ≤ 0.01 using the sign test
SLIDE 45
Results on Sections 22–24
LR LP F 1 Collins parser 88.19 88.60 88.39 MIX 89.41 89.87 89.64 Collins reranker 89.46 90.07 89.76 MIX+ 89.78 90.29 90.03
All comparisons except MIX vs. Collins reranker are significant at p ≤ 0.01 using the sign test
SLIDE 46
Overview of talk
Motivation Design Results Discussion Conclusion
SLIDE 47
Previous work
Parsing approaches that use hidden variables
Riezler et al. (2002) Clark and Curran (2004) Matsuzaki et al. (2005)
Differences with our approach
Use of reranking Definition of hidden variables Use of belief propagation
SLIDE 48 Using packed representations
Candidates ti,j represented as a packed forest
Compact representation of many parse trees
Packed representation forces local scope
Features would become locally scoped on non-hidden information Decoding becomes NP-hard, must approximate with Viterbi (cf. Matsuzaki et al., 2005):
argmax p(ti,j| si, Θ) ≈ argmax p(ti,j, h | si, Θ)
ti,j ti,j,h
SLIDE 49
Empirical analysis of hidden values
The model makes hidden–value assignments on the basis of the reranking criterion
i.e. maximize log p(ti,1 | si, Θ)
The empirical distribution of assignments shows linguistically reasonable trends
SLIDE 50
Overview of talk
Motivation Design Results Discussion Conclusion
SLIDE 51
Concluding remarks
The hidden–variable model defines a new representation for NLP structures Conditional log–linear model with hidden variables BP enables efficient and exact training and decoding Significant improvement over Collins (2000)
SLIDE 52
Example
SLIDE 53
Example
I [will [give/VB2 an example]] I expected [to [give/VB1 an example]] I expected [to/TO4 [give an example]] You expected [me [to/TO1,5 [give an example]]]