HiddenVariable Models for Discriminative Reranking Terry Koo and - - PowerPoint PPT Presentation

hidden variable models for discriminative reranking
SMART_READER_LITE
LIVE PREVIEW

HiddenVariable Models for Discriminative Reranking Terry Koo and - - PowerPoint PPT Presentation

HiddenVariable Models for Discriminative Reranking Terry Koo and Michael Collins { maestro|mcollins } @csail.mit.edu Overview of reranking The reranking approach Use a baseline model to get the N -best candidates Rerank the candidates


slide-1
SLIDE 1

Hidden–Variable Models for Discriminative Reranking

Terry Koo and Michael Collins

{maestro|mcollins}@csail.mit.edu

slide-2
SLIDE 2

Overview of reranking

The reranking approach

Use a baseline model to get the N-best candidates “Rerank” the candidates using a more complex model

Parse reranking

Collins (2000): 88.2% ⇒ 89.8% Charniak and Johnson (2005): 89.7% ⇒ 91.0% Talk by Brooke Cowan in 7B: 83.6% ⇒ 85.1%

Also applied to

MT (Och and Ney, 2002; Shen et al., 2004) NL Generation (Walker et al., 2001)

slide-3
SLIDE 3

Representing NLP structures

Proper representation is critical to success Hand–crafted feature vector representations

Φ( ) = {0, 1, 2, 0, 0, 3, 0, 1}

Features defined through kernels

K( , ) = Φ( )·Φ( )

This talk: A new approach using hidden variables

slide-4
SLIDE 4

Two facets of lexical items

Different lexical items can have similar meanings, e.g. president and chairman

Clustering: president, chairman ∈ NounCluster4

A single lexical item can have different meanings, e.g. [river] bank vs [financial] bank

Refinement: bank 1, bank 2 ∈ bank

Model clusterings and refinements as hidden variables that support the reranking task

slide-5
SLIDE 5

Highlights of the approach

Conditional log–linear model with hidden variables Dynamic programming is used for training and decoding Clustering and refinement done automatically using a discriminative criterion

slide-6
SLIDE 6

Overview of talk

Motivation Design

General form of the model Training and decoding efficiently Creating specific instantiations

Results Discussion Conclusion

slide-7
SLIDE 7

The parse reranking framework

Sentences si for 1 ≤ i ≤ n

s1: Pierre Vinken , 61 years old , will join ... s2: Mr. Vinken is chairman of Elsevier N.V. ... s3: Big Board Chairman John Phelan said yesterday ...

Each si has candidate parses ti,j for 1 ≤ j ≤ ni

ti,1 is the best candidate parse for si

slide-8
SLIDE 8

The parse reranking framework

ti,j has phrase structure and dependency tree

Mr. Vinken is chairman

  • f

Elsevier N.V. NNP NP NP PP NP VP S NP NNP NNP VB NN IN NNP

slide-9
SLIDE 9

The parse reranking framework

ti,j has phrase structure and dependency tree

Vinken is chairman

  • f

Elsevier N.V. Mr. NNP VB NN IN NNP NNP NP NP NP PP NP VP S NNP

slide-10
SLIDE 10

Adding hidden variables

Hidden–value domains Hw(ti,j) for 1 ≤ w ≤ len(si)

1

NN NN

1 2

NN3 IN1 IN3 IN2 NNP

1

NNP

3

NNP

2

NNP NNP

2 3

NNP

Mr. Vinken is chairman

  • f

Elsevier N.V.

NNP

1

NNP

3

NNP

2

NNP NNP

2 3

NNP

1

VB VB

2 3

VB1

S NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP

slide-11
SLIDE 11

Adding hidden variables

Assignment h ∈ H1(ti,j) × ... × Hlen(si)(ti,j)

NN NNP

1

NNP

3

NNP

2

NNP NNP

2 3

NNP

1

VB VB

2 3

VB1 NN

1 2

NN3 NNP NNP

2 3

NNP

1

NNP

1

NNP

3

NNP

2

IN1 IN3 IN2

Mr. Vinken is chairman

  • f

Elsevier N.V. NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP S

slide-12
SLIDE 12

Marginalized probability model

Φ(ti,j, h) produces a descriptive vector of feature

  • ccurrence counts, e.g.

Φ2(ti,j, h) = Count(chairman has hidden value NN1) Φ13(ti,j, h) = Count(NNP2 is a direct object of VB1) Φ19(ti,j, h) = Count(NN1 coordinates with NN2)

slide-13
SLIDE 13

Marginalized probability model

Log–linear distribution over (ti,j, h) with parameters Θ:

p(ti,j, h | si, Θ) = eΦ(ti,j, h)·Θ

  • j′,h′ eΦ(ti,j′, h′)·Θ

Marginalize over assignments h:

p(ti,j | si, Θ) =

h

p(ti,j, h | si, Θ)

slide-14
SLIDE 14

Optimizing the parameters

Define loss as negative log-likelihood

L(Θ) = -

n

  • i=1

log p(ti,1 | si, Θ)

Minimize L(Θ) through gradient descent

∂L ∂Θ = -

i

  • h

p(h | ti,1, si, Θ)Φ(ti,1, h) +

i,j

p(ti,j | si, Θ)

h

p(h | ti,j, si, Θ)Φ(ti,j, h)

slide-15
SLIDE 15

Overview of talk

Motivation Design

General form of the model Training and decoding efficiently Creating specific instantiations

Results Discussion Conclusion

slide-16
SLIDE 16

Problems with efficiency

|H1(ti,j) × ... × Hlen(si)(ti,j)| grows exponentially, so training the model is intractable:

∂L ∂Θ = -

i

  • h

p(h | ti,1, si, Θ)Φ(ti,1, h) +

i,j

p(ti,j | si, Θ)

h

p(h | ti,j, si, Θ)Φ(ti,j, h)

Decoding the model is also intractable:

p(ti,j | si, Θ) =

h

p(ti,j, h | si, Θ)

slide-17
SLIDE 17

Problems with efficiency

|H1(ti,j) × ... × Hlen(si)(ti,j)| grows exponentially, so training the model is intractable:

∂L ∂Θ = -

i

  • h

p(h | ti,1, si, Θ)Φ(ti,1, h) +

i,j

p(ti,j | si, Θ)

h

p(h | ti,j, si, Θ)Φ(ti,j, h)

Decoding the model is also intractable:

p(ti,j | si, Θ) =

h

p(ti,j, h | si, Θ)

slide-18
SLIDE 18

Locality constraint on features

Features have pairwise local scope on hidden variables Features still have global scope on non-hidden information Φ can be factored into local feature vectors, allowing dynamic programming

slide-19
SLIDE 19

Local feature vectors

Define two kinds of local feature vector φ:

Single-variable φ(ti,j, w, hw) look at a single hidden variable Pairwise φ(ti,j, u, v, hu, hv) look at two hidden variables in a dependency relationship

slide-20
SLIDE 20

Local feature vectors

Φ(ti,j, h) looks at every hidden variable

NN NNP

1

NNP

3

NNP

2

NNP NNP

2 3

NNP

1

VB VB

2 3

VB1 NN

1 2

NN3 NNP NNP

2 3

NNP

1

NNP

1

NNP

3

NNP

2

IN1 IN3 IN2

Mr. Vinken is chairman

  • f

Elsevier N.V. NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP S

slide-21
SLIDE 21

Local feature vectors

φ(ti,j, chairman, NN3) only sees NN3

3

NN NN

1 2

NN

Mr. Vinken is chairman

  • f

Elsevier N.V. NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP S

slide-22
SLIDE 22

Local feature vectors

φ(ti,j, chairman, of, NN3, IN2) sees NN3 and IN2

1

IN3 IN2 NN NN

1

Mr. Vinken is chairman

  • f

Elsevier N.V.

2

NN3 IN

S NNP NNP VB NN IN NNP NNP NP NP NP PP NP VP

slide-23
SLIDE 23

Local feature vectors

Rewrite global Φ as a sum over local φ

Φ(ti,j, h) =

  • w ∈ ti,j

φ(ti,j, w, hw) +

  • (u,v) ∈ D(ti,j)

φ(ti,j, u, v, hu, hv)

slide-24
SLIDE 24

Local feature vectors

Rewrite global Φ as a sum over local φ

Φ(ti,j, h) =

  • w ∈ ti,j

φ(ti,j, w, hw) +

  • (u,v) ∈ D(ti,j)

φ(ti,j, u, v, hu, hv)

slide-25
SLIDE 25

Applying belief propagation

New restrictions enable dynamic–programming approaches, e.g. belief propagation

BP generalizes the forward–backward algorithm from a chain to a tree Runtime O(len(si)H2), H = max |Hw(ti,j)|

BP efficiently computes

  • h

p(ti,j, h | si, Θ)

  • h

p(h | ti,j, si, Θ)Φ(ti,j, h)

slide-26
SLIDE 26

Overview of talk

Motivation Design

General form of the model Training and decoding efficiently Creating specific instantiations

Results Discussion Conclusion

slide-27
SLIDE 27

Two areas for choice in the model

Definition of the hidden–value domains Hw(ti,j) Definition of the feature vectors φ

slide-28
SLIDE 28

Hidden–value domains

Lexical domains allow word refinement

chairman Mr. Vinken is

  • f

Elsevier N.V.

chairman chairman Elsevier Elsevier Elsevier

1 2 3 1 2 3 1 3 1 2 3

Mr. Mr. Mr.

1 2 3 2

  • f
  • f
  • f1

2 3

N.V. N.V. N.V.

1 2 3

Vinken Vinken Vinken is is is chairman

slide-29
SLIDE 29

Hidden–value domains

Lexical domains allow word refinement

chairman Mr. Vinken is

  • f

Elsevier N.V.

Elsevier Elsevier

1 2 3 1 2 3 1 2 3

Mr. Mr. Mr.

1 2 3

chairman chairman1

2 3

chairman

  • f
  • f
  • f1

2 3

N.V. N.V. N.V.

1 2 3

Vinken Vinken Vinken is is is Elsevier

slide-30
SLIDE 30

Hidden–value domains

Part-of-speech domains allow word clustering

Elsevier Mr. Vinken is chairman

  • f

N.V.

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

VB VB VB VB VB NN NN NNP NN NN1

2 3 4 5 1 2 3 4 5

IN IN IN IN IN NN NNP NNP NNP NNP

1 2 3 4 5 1 2 3 4 5

NNP NNP NNP NNP NNP

slide-31
SLIDE 31

Hidden–value domains

Part-of-speech domains allow word clustering

Elsevier Mr. Vinken is

  • f

N.V. chairman

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

VB VB VB VB VB

1 2

NNP

4 5

IN IN IN IN IN NN NN NN NN NN1

2 3 4 5 3

NNP NNP NNP NNP

1 2 3 4 5 1 2 3 4 5

NNP NNP NNP NNP NNP

slide-32
SLIDE 32

Hidden–value domains

Part-of-speech domains allow word clustering

Elsevier is chairman

  • f

Mr. Vinken N.V.

1 2 3 4 5

IN IN IN IN IN

1 2 3 4 5

NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP

1 2 3 4 5 1 2 1 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

3 2 3 4 5

VB VB VB VB VB NN NN NN NN NN1

2 3 4 5

slide-33
SLIDE 33

Hidden–value domains

Supersense domains model WordNet ontology

(Ciaramita and Johnson, 2003; Miller et at., 1993)

Vinken Elsevier N.V.

  • f

is chairman Mr.

3

noun.person noun.person noun.person

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 1 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

IN IN IN IN IN

2 2 3

verb.stative verb.stative verb.stative

1 2 3

verb.social verb.social verb.social

1 2 3

verb.possession verb.possession verb.possession

1 2

slide-34
SLIDE 34

Hidden–value domains

Supersense domains model WordNet ontology

(Ciaramita and Johnson, 2003; Miller et at., 1993)

chairman

  • f

Elsevier is Mr. Vinken N.V.

3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP

1

NNP NNP

1 2 3 4 5

IN IN IN IN IN

1 2 3

noun.person noun.person noun.person NNP

2 3

verb.stative verb.stative verb.stative

1 2 3

verb.social verb.social verb.social

1 2 3

verb.possession verb.possession verb.possession

1 2

slide-35
SLIDE 35

Hidden–value domains

Supersense domains model WordNet ontology

(Ciaramita and Johnson, 2003; Miller et at., 1993)

is Elsevier N.V.

  • f

chairman Mr. Vinken

5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

IN IN IN

1

IN

1 2 3

verb.possession verb.possession verb.possession

1 2 3

verb.social verb.social verb.social

1 2 3

verb.stative verb.stative verb.stative IN

2 3

noun.person noun.person noun.person

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4

slide-36
SLIDE 36

Hidden–value domains

Supersense domains model WordNet ontology

(Ciaramita and Johnson, 2003; Miller et at., 1993)

Vinken Elsevier N.V.

  • f

is chairman Mr.

3

noun.person noun.person noun.person

1 2 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

NNP NNP NNP NNP NNP

1 1 3 4 5

NNP NNP NNP NNP NNP

1 2 3 4 5

IN IN IN IN IN

1 2 3 4 5

NNP NNP NNP NNP NNP

2 2 3

verb.stative verb.stative verb.stative

1 2 3

verb.social verb.social verb.social

1 2 3

verb.possession verb.possession verb.possession

1 2

slide-37
SLIDE 37

Hidden–value domains

Hidden–value domains that didn’t work well

Word clustering without part-of-speech subdivisions WordNet hyper/hyponym ontology Domains containing mixed values

slide-38
SLIDE 38

Examples of features

Elsevier headed by chairman The highest nonterminal

2

NN3 IN1 IN3 NNP

1

NNP

3

NNP NNP

2 3

NNP

1

IN2 NNP

2

NNP

1

Vinken

  • f

N.V. chairman

VB1

is Mr.

NNP

2

NP(chairman)

NNP

1

NNP

3

NNP NNP

2 3

VB VB

2 3

NN NN

1

S NNP IN NNP NNP NP NP PP VP VB NN NP NNP (NN3, Word=chairman, Highest Nonterminal=NP) ∈ φ(ti,j, chairman, NN3)

slide-39
SLIDE 39

Examples of features

The governing rule

NN

1 2

NN3 IN1 IN3 NNP

1

NNP

3

NNP NNP

2 3

NNP

1

IN2 NNP

2

NNP

1

Vinken

  • f

Elsevier N.V. Mr.

NNP

2

VB1

is chairman

NNP

1

NNP

3

NNP NNP

2 3

VB VB

2 3

NN

NP NP PP NN NP NNP S NP VP VB NNP IN NNP NNP (VB1, NN3, Rule=VP →VB NP) ∈ φ(ti,j, is, chairman, VB1, NN3)

slide-40
SLIDE 40

Overview of talk

Motivation Design Results Discussion Conclusion

slide-41
SLIDE 41

Experimental Setup

N-best lists generated by Collins parser, ni ≈ 30 Training set: WSJ sections 2–21 Development set: WSJ section 0 Test set: WSJ sections 22-24

slide-42
SLIDE 42

Final test models

Two baseline models

The Collins (1999) base parser The Collins (2000) reranker

Two mixed models

MIX combined clustering, refinement, and WordNet MIX+ augments MIX with features of Collins reranker

slide-43
SLIDE 43

Results on Sections 22–24

LR LP F 1 Collins parser 88.19 88.60 88.39 MIX 89.41 89.87 89.64 Collins reranker 89.46 90.07 89.76 MIX+ 89.78 90.29 90.03

All comparisons except MIX vs. Collins reranker are significant at p ≤ 0.01 using the sign test

slide-44
SLIDE 44

Results on Sections 22–24

LR LP F 1 Collins parser 88.19 88.60 88.39 MIX 89.41 89.87 89.64 Collins reranker 89.46 90.07 89.76 MIX+ 89.78 90.29 90.03

All comparisons except MIX vs. Collins reranker are significant at p ≤ 0.01 using the sign test

slide-45
SLIDE 45

Results on Sections 22–24

LR LP F 1 Collins parser 88.19 88.60 88.39 MIX 89.41 89.87 89.64 Collins reranker 89.46 90.07 89.76 MIX+ 89.78 90.29 90.03

All comparisons except MIX vs. Collins reranker are significant at p ≤ 0.01 using the sign test

slide-46
SLIDE 46

Overview of talk

Motivation Design Results Discussion Conclusion

slide-47
SLIDE 47

Previous work

Parsing approaches that use hidden variables

Riezler et al. (2002) Clark and Curran (2004) Matsuzaki et al. (2005)

Differences with our approach

Use of reranking Definition of hidden variables Use of belief propagation

slide-48
SLIDE 48

Using packed representations

Candidates ti,j represented as a packed forest

Compact representation of many parse trees

Packed representation forces local scope

Features would become locally scoped on non-hidden information Decoding becomes NP-hard, must approximate with Viterbi (cf. Matsuzaki et al., 2005):

argmax p(ti,j| si, Θ) ≈ argmax p(ti,j, h | si, Θ)

ti,j ti,j,h

slide-49
SLIDE 49

Empirical analysis of hidden values

The model makes hidden–value assignments on the basis of the reranking criterion

i.e. maximize log p(ti,1 | si, Θ)

The empirical distribution of assignments shows linguistically reasonable trends

slide-50
SLIDE 50

Overview of talk

Motivation Design Results Discussion Conclusion

slide-51
SLIDE 51

Concluding remarks

The hidden–variable model defines a new representation for NLP structures Conditional log–linear model with hidden variables BP enables efficient and exact training and decoding Significant improvement over Collins (2000)

slide-52
SLIDE 52

Example

slide-53
SLIDE 53

Example

I [will [give/VB2 an example]] I expected [to [give/VB1 an example]] I expected [to/TO4 [give an example]] You expected [me [to/TO1,5 [give an example]]]