Convex relaxations for weakly supervised information extraction - - PowerPoint PPT Presentation

convex relaxations for weakly supervised information
SMART_READER_LITE
LIVE PREVIEW

Convex relaxations for weakly supervised information extraction - - PowerPoint PPT Presentation

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia University edouard.grave@gmail.com Information Extraction Extract structured information from unstructured documents. Information Extraction Extract


slide-1
SLIDE 1

Convex relaxations for weakly supervised information extraction ´ Edouard Grave

Columbia University edouard.grave@gmail.com

slide-2
SLIDE 2

Information Extraction

Extract structured information from unstructured documents.

slide-3
SLIDE 3

Information Extraction

Extract structured information from unstructured documents.

slide-4
SLIDE 4

Example: named entity recognition

Detect and classify mentions of named entities in text. The seven-month re-examination of why U.S. forces were caught

  • ff-guard by the Japanese attack was done at the request of Sen.

Strom Thurmond, R-S.C., chairman of the Senate Armed Services Committee, and members of the Kimmel family.

slide-5
SLIDE 5

Example: named entity recognition

Detect and classify mentions of named entities in text. The seven-month re-examination of why [U.S.]LOC forces were caught off-guard by the Japanese attack was done at the request of

  • Sen. [Strom Thurmond]PER, R-[S.C.]LOC, chairman of the [Senate

Armed Services Committee]ORG, and members of the [Kimmel]PER family. Traditionally, detect mentions of

  • people (PER),
  • locations (LOC),
  • organizations (ORG).
slide-6
SLIDE 6

Example: named entity recognition

Detect and classify mentions of named entities in text. The seven-month re-examination of why [U.S.]LOC forces were caught off-guard by the Japanese attack was done at the request of

  • Sen. [Strom Thurmond]PER, R-[S.C.]LOC, chairman of the [Senate

Armed Services Committee]ORG, and members of the [Kimmel]PER family. Named entities can also be:

  • genes, cells, proteins, etc.
  • books, movies, games, etc.
  • laptops, phones, camera, etc.
slide-7
SLIDE 7

Example: entity linking

Link an entity mention (e.g. Michael Jordan) to a knowledge base

slide-8
SLIDE 8

Example: relation extraction

Extract binary relations between named entities from text During World War II, Turing worked for the Government Code and Cypher School (GC&CS) at Bletchley Park.

slide-9
SLIDE 9

Example: relation extraction

Extract binary relations between named entities from text During World War II, Turing worked for the Government Code and Cypher School (CG&CS) at Bletchley Park.

slide-10
SLIDE 10

Example: relation extraction

Extract binary relations between named entities from text During World War II, Turing worked for the Government Code and Cypher School (CG&CS) at Bletchley Park. Employee(Alan Turing, CG&CS) Contains(Bletchley Park, CG&CS)

slide-11
SLIDE 11

Challenges of information extraction

Most state-of-the-art methods: supervised machine learning.

slide-12
SLIDE 12

Challenges of information extraction

Most state-of-the-art methods: supervised machine learning.

  • Needs (a lot of) labeled data:
  • expensive to obtain (need expertise),
  • thousands different kinds of entities / relations,
  • ressources for English. But French? Spanish? Russian?
slide-13
SLIDE 13

Challenges of information extraction

Most state-of-the-art methods: supervised machine learning.

  • Needs (a lot of) labeled data:
  • expensive to obtain (need expertise),
  • thousands different kinds of entities / relations,
  • ressources for English. But French? Spanish? Russian?
  • Not robust to domain shift:

Our distribution agreement with [Henry Schein]PER renews annu- ally unless terminated by either party.

slide-14
SLIDE 14
  • I. Relation extraction
slide-15
SLIDE 15

Distant supervision for relation extraction

Craven and Kumlien (1999); Mintz et al. (2009)

Knowledge base r e1 e2 BornIn Lichtenstein New York City DiedIn Lichtenstein New York City Sentences Roy Lichtenstein was born in New York City, into an upper-middle-class family. In 1961, Leo Castelli started displaying Lichten- stein’s work at his gallery in New York. Roy Lichtenstein died of pneumonia in 1997 in New York City.

slide-16
SLIDE 16

Distant supervision for relation extraction

Craven and Kumlien (1999); Mintz et al. (2009)

Knowledge base r e1 e2 BornIn Lichtenstein New York City DiedIn Lichtenstein New York City Sentences Roy Lichtenstein was born in New York City, into an upper-middle-class family. In 1961, Leo Castelli started displaying Lichten- stein’s work at his gallery in New York. Roy Lichtenstein died of pneumonia in 1997 in New York City.

slide-17
SLIDE 17

Distant supervision for relation extraction

Craven and Kumlien (1999); Mintz et al. (2009)

Knowledge base r e1 e2 BornIn Lichtenstein New York City DiedIn Lichtenstein New York City Sentences Latent label Roy Lichtenstein was born in New York City, into an upper-middle-class family. BornIn In 1961, Leo Castelli started displaying Lichten- stein’s work at his gallery in New York. None Roy Lichtenstein died of pneumonia in 1997 in New York City. DiedIn

slide-18
SLIDE 18

Multiple instance, multiple label learning

Bunescu and Mooney (2007); Riedel et al. (2010); Hoffmann et al. (2011); Surdeanu et al. (2012)

(Lichtenstein, New York City) Roy Lichtenstein was born in New York City. Lichtenstein left New York to study in Ohio. BornIn DiedIn

slide-19
SLIDE 19

Multiple instance, multiple label learning

Bunescu and Mooney (2007); Riedel et al. (2010); Hoffmann et al. (2011); Surdeanu et al. (2012)

(Lichtenstein, New York City) Roy Lichtenstein was born in New York City. Lichtenstein left New York to study in Ohio. BornIn DiedIn N pair mentions represented by vec- tors xn I pairs of entities pi K relations Ein Rik Ein = 1 if pair mention n corresponds to entity pair i Rik = 1 if entity pair i verifies relation k

slide-20
SLIDE 20

Overview

Two steps procedure:

1 infer labels for each pair mention; 2 train supervised instance level relation extractor.

Goal: infer a binary matrix Y such that:

  • Ynk = 1 if pair mention n express relation k;
  • Ynk = 0 otherwise.

Approach based on discriminative clustering.

slide-21
SLIDE 21

(a) Discriminative clustering

slide-22
SLIDE 22

Discriminative clustering

Xu et al. (2004); Bach and Harchaoui (2007)

slide-23
SLIDE 23

Discriminative clustering

Xu et al. (2004); Bach and Harchaoui (2007)

slide-24
SLIDE 24

Discriminative clustering

Xu et al. (2004); Bach and Harchaoui (2007)

slide-25
SLIDE 25

Discriminative clustering

Xu et al. (2004); Bach and Harchaoui (2007)

Given a loss function ℓ and a regularizer Ω: min

Y

min

f N

  • n=1

ℓ(yn, f (xn)) + Ω(f ), s.t. Y ∈ Y

slide-26
SLIDE 26

(b) Weak supervision by constraining Y

slide-27
SLIDE 27

Weak supervision by constraining Y

Each pair mention express exactly one relation:

slide-28
SLIDE 28

Weak supervision by constraining Y

Each pair mention express exactly one relation: ∀n ∈ {1, ..., N},

K+1

  • k=1

Ynk = 1.

slide-29
SLIDE 29

Weak supervision by constraining Y

If entity pair i verifies relation k, then at least one pair mention n corresponding to the pair i express that relation:

slide-30
SLIDE 30

Weak supervision by constraining Y

If entity pair i verifies relation k, then at least one pair mention n corresponding to the pair i express that relation: ∀(i, k) such that Rik = 1,

  • n : Ein=1

Ynk ≥ 1. Ein = 1 if pair mention n corresponds to entity pair i

slide-31
SLIDE 31

Weak supervision by constraining Y

If entity pair i verifies relation k, then at least one pair mention n corresponding to the pair i express that relation: ∀(i, k) such that Rik = 1,

N

  • n=1

EinYnk ≥ 1. Ein = 1 if pair mention n corresponds to entity pair i

slide-32
SLIDE 32

Weak supervision by constraining Y

If entity pair i does not verify relation k, then no pair mention n corresponding to pair i express that relation:

slide-33
SLIDE 33

Weak supervision by constraining Y

If entity pair i does not verify relation k, then no pair mention n corresponding to pair i express that relation: ∀(i, k) such that Rik = 0,

  • n : Ein=1

Ynk = 0. Ein = 1 if pair mention n corresponds to entity pair i

slide-34
SLIDE 34

Weak supervision by constraining Y

If entity pair i does not verify relation k, then no pair mention n corresponding to pair i express that relation: ∀(i, k) such that Rik = 0,

N

  • n=1

EinYnk = 0. Ein = 1 if pair mention n corresponds to entity pair i

slide-35
SLIDE 35

Weak supervision by constraining Y

For a given entity pair i, at most c percent of pair mentions classified as none:

slide-36
SLIDE 36

Weak supervision by constraining Y

For a given entity pair i, at most c percent of pair mentions classified as none: ∀i ∈ {1, ..., I},

N

  • n=1

EinYn(K+1) ≤ c

N

  • n=1

Ein,

slide-37
SLIDE 37

Weak supervision by constraining Y

These constraints are equivalent to: Y1 = 1, (EY) ◦ S ≥ R.

slide-38
SLIDE 38

(c) Problem formulation

slide-39
SLIDE 39

Problem formulation

Using linear classifiers W ∈ RD×(K+1) and the squared loss: min

Y,W

1 2Y − XW2

F + λ

2 W2

F,

s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R.

slide-40
SLIDE 40

Problem formulation

Using linear classifiers W ∈ RD×(K+1) and the squared loss: min

Y,W

1 2Y − XW2

F + λ

2 W2

F,

s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. Closed form solution for W: W = (X⊤X + λID)−1X⊤Y.

slide-41
SLIDE 41

Problem formulation

Replacing W by its optimal value: min

Y

1 2tr

  • Y⊤(XX⊤ + λIN)−1Y
  • ,

s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R.

slide-42
SLIDE 42

Problem formulation

Replacing W by its optimal value: min

Y

1 2tr

  • Y⊤(XX⊤ + λIN)−1Y
  • ,

s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. This is a quadratic integer program. Hard to solve in general.

slide-43
SLIDE 43

Convex relaxation

Relaxing the constraints Y ∈ {0, 1}N×(K+1) into Y ∈ [0, 1]N×(K+1): min

Y

1 2tr

  • Y⊤(XX⊤ + λIN)−1Y
  • ,

s.t. Y ∈ [0, 1]N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. This is a convex quadratic program.

slide-44
SLIDE 44

Convex relaxation

Relaxing the constraints Y ∈ {0, 1}N×(K+1) into Y ∈ [0, 1]N×(K+1): min

Y

1 2tr

  • Y⊤(XX⊤ + λIN)−1Y
  • ,

s.t. Y ∈ [0, 1]N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. This is a convex quadratic program. It only depends on the kernel XX⊤.

slide-45
SLIDE 45

Rounding

Given a solution of the relaxed problem Y, orthogonal projection on:

  • M ∈ {0, 1}N×(K+1) | M1 = 1
  • .

Consists in taking the argmax along the rows of Y.

slide-46
SLIDE 46

Optimization

We optimize the dual because:

  • there is no matrix inverse (easy to compute the gradient),
  • constraints are simpler (easy to project on the constraints).

We use accelerated projected gradient descent algorithm (FISTA). Overall complexity: O(NFK).

  • N is the number of pair mentions (sentences);
  • F is the average number of features;
  • K is the number of classes.
slide-47
SLIDE 47

(d) Experiments

slide-48
SLIDE 48

Experiments: dataset

Dataset introduced by Riedel et al. (2010):

  • Articles from the New York Times corpus.
  • Entities extracted using Stanford named entities recognizer.
  • Entity mentions aligned to Freebase using a string match.

There are

  • 52 relations,
  • 4, 200 entity pairs,
  • 120, 000 pair mentions.
slide-49
SLIDE 49

Experiments: features

We use the features proposed by Mintz et al. (2009):

  • Lexical features, such as:
  • sequence of words between entities;
  • window of k words before/after the first/second entity;
  • corresponding part-of-speeches;
  • Syntactic features, such as:
  • path in the dependency tree between the two entities;
  • neighbors of the two entities that are not in the path.
slide-50
SLIDE 50

Experiments: results

0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Mintz et al. (2009)

Figure: Precision/recall curves for different methods on the Riedel et al. (2010) dataset, for the task of aggregate extraction.

slide-51
SLIDE 51

Experiments: results

0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Mintz et al. (2009) Hoffmann et al. (2011) Surdeanu et al. (2012)

Figure: Precision/recall curves for different methods on the Riedel et al. (2010) dataset, for the task of aggregate extraction.

slide-52
SLIDE 52

Experiments: results

0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Mintz et al. (2009) Hoffmann et al. (2011) Surdeanu et al. (2012) This work

Figure: Precision/recall curves for different methods on the Riedel et al. (2010) dataset, for the task of aggregate extraction.

slide-53
SLIDE 53

Experiments: results

0.0 0.1 0.2 0.3 0.4 0.5 0.6 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision /location/location/contains /people/person/place_lived /person/person/nationality /people/person/place_of_birth /business/person/company

Figure: Precision/recall curves per relation for our method, for the task of aggregate extraction, on the Riedel et al. (2010) dataset.

slide-54
SLIDE 54

Experiments: results

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Hoffmann et al. (2011) This work Figure: Precision/recall curves for the task of sentential extraction, on the manually labeled dataset of Hoffmann et al. (2011).

slide-55
SLIDE 55
  • II. Named entity classification
slide-56
SLIDE 56

Motivation

Extract named entities from technical text (e.g. financial reports).

slide-57
SLIDE 57

Motivation

Extract named entities from technical text (e.g. financial reports). Limitations of state-of-the-art NER:

  • lack of labeled data for technical domains,
  • suffers from domain shift.

Examples of errors (from financial reports of healthcare companies):

  • Henry Schein classified as Person,
  • Aspen classified as Location.

Both are healthcare companies.

slide-58
SLIDE 58

Motivation

Extract named entities from technical text (e.g. financial reports). Limitations of state-of-the-art NER:

  • lack of labeled data for technical domains,
  • suffers from domain shift.

Examples of errors (from financial reports of healthcare companies):

  • Henry Schein classified as Person,
  • Aspen classified as Location.

Both are healthcare companies. For some domains: named entities are (almost) not ambiguous.

slide-59
SLIDE 59

Bootstrapping for named entity extraction

Riloff and Jones (1999); Collins and Singer (1999)

Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.

  • St. Jude Medical acquired Endosense for $171 mil-

lion in net cash consideration.

slide-60
SLIDE 60

Bootstrapping for named entity extraction

Riloff and Jones (1999); Collins and Singer (1999)

Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.

  • St. Jude Medical acquired Endosense for $171 mil-

lion in net cash consideration.

slide-61
SLIDE 61

Bootstrapping for named entity extraction

Riloff and Jones (1999); Collins and Singer (1999)

Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.

  • St. Jude Medical acquired Endosense for $171 mil-

lion in net cash consideration. Patterns: [COMPANY] acquired acquired [COMPANY]

slide-62
SLIDE 62

Bootstrapping for named entity extraction

Riloff and Jones (1999); Collins and Singer (1999)

Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.

  • St. Jude Medical acquired Endosense for $171 mil-

lion in net cash consideration. Patterns: [COMPANY] acquired acquired [COMPANY]

slide-63
SLIDE 63

Overview

Input: seed list of entities and unlabeled text.

1 Extract potential named entity mentions (sequences of continuous

tokens tagged as NNP or NNPS),

2 Try to match each mention to a seed, using exact string match, 3 Train a multiclass classifier using those examples.

slide-64
SLIDE 64

Overview

Input: seed list of entities and unlabeled text.

1 Extract potential named entity mentions (sequences of continuous

tokens tagged as NNP or NNPS),

2 Try to match each mention to a seed, using exact string match, 3 Train a multiclass classifier using those examples.

Step 3: positive and unlabeled examples only; no negative examples. Instance of PU-learning (learning from positive and unlabeled examples).

slide-65
SLIDE 65

Notations

  • xn: vectors describing the named entity mentions;
  • P: set of indices of positive examples;
  • U: set of indices of unlabeled examples;
  • cn: labels of the postive examples (K + 1 corresponds to Other).
slide-66
SLIDE 66

Notations

  • xn: vectors describing the named entity mentions;
  • P: set of indices of positive examples;
  • U: set of indices of unlabeled examples;
  • cn: labels of the postive examples (K + 1 corresponds to Other).

Infer a binary matrix Y ∈ {0, 1}N×(K+1) such that: Ynk = 1 if named entity mention n is of type k,

  • therwise.

and corresponding classifier f such that: f (xn) = yn.

slide-67
SLIDE 67

Weak supervision by constraining Y

Each named entity mention belongs to exactly one class: ∀n ∈ {1, ..., N},

K+1

  • k=1

Ynk = 1.

slide-68
SLIDE 68

Weak supervision by constraining Y

For positive examples, Y agrees with distant supervision: ∀n ∈ P, Yncn = 1.

slide-69
SLIDE 69

Weak supervision by constraining Y

Impose percentage of examples classified as Other to be at least p:

  • n∈U

Yn(K+1) ≥ pN.

slide-70
SLIDE 70

Problem formulation

Using linear classifiers W ∈ RD×(K+1) and the squared loss: min

Y,W

1 2Y − XW2

F + λ

2 W2

F

s.t. Y ∈ {0, 1}N×(K+1), Y ∈ Y. This is a quadratic integer program. Hard to solve in general.

slide-71
SLIDE 71

Convex relaxation

Relaxing the constraints Y ∈ {0, 1}N×(K+1) into Y ∈ [0, 1]N×(K+1): min

Y,W

1 2Y − XW2

F + λ

2 W2

F

s.t. Y ∈ [0, 1]N×(K+1), Y ∈ Y, This problem is jointly convex in Y and W.

slide-72
SLIDE 72

Experiments: dataset and features

Data: financial reports corresponding to healthcare companies. Seed lists:

  • 578 publicly traded healthcare companies,
  • 200 most searched drugs on the website www.rxlist.com.

Features:

  • lowercased tokens of the mention,
  • window of k words to the left/right of the mention,
  • k ancestors (with syntactic roles) in the dependency tree,
  • vectorial representation of the mention.
slide-73
SLIDE 73

Experiments: results

Companies Drugs P R F1 P R F1 Stanford NER N/A 52.6 N/A N/A N/A N/A String match 98.9 44.2 61.1 100 32.3 48.8 SVM (asym) 87.0 92.8 89.8 86.5 79.2 82.7 This work 82.9 95.8 88.9 87.4 94.0 90.6

slide-74
SLIDE 74

Experiments: results

0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Companies

precision recall f1 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Drugs

precision recall f1

Figure: Influence of parameter p.

slide-75
SLIDE 75

Conclusion

Distant supervision for information extraction:

  • based on discriminative clustering, using distant supervision as

constraints;

  • leads to a convex formulation;
  • competitive to state-of-the-art on relation extraction.
slide-76
SLIDE 76

Conclusion

Distant supervision for information extraction:

  • based on discriminative clustering, using distant supervision as

constraints;

  • leads to a convex formulation;
  • competitive to state-of-the-art on relation extraction.

Work in progress:

  • faster optimization methods for our approach;
  • kernelization of our methods;
  • generalize to ambiguous named entities.
slide-77
SLIDE 77

References

A convex relaxation for weakly supervised relation extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. Weakly supervised named entity classification. Proceedings of the Workshop on Automated Knowledge Base Construction (AKBC), 2014.

slide-78
SLIDE 78

References

A convex relaxation for weakly supervised relation extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. Weakly supervised named entity classification. Proceedings of the Workshop on Automated Knowledge Base Construction (AKBC), 2014. Code: Available on my webpage but “research code”. Might release more general code in the future.

slide-79
SLIDE 79

Thank you for your attention!

slide-80
SLIDE 80

References I

Bach, F. and Harchaoui, Z. (2007). Diffrac: a discriminative and flexible framework for clustering. In NIPS. Bunescu, R. and Mooney, R. (2007). Learning to extract relations from the web using minimal supervision. In Annual meeting-association for Computational Linguistics. Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In EMNLP. Craven, M. and Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. In ISMB. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., and Weld, D. (2011). Knowledge-based weak supervision for information extraction

  • f overlapping relations. In Proceedings of the 49th Annual Meeting
  • f the Association for Computational Linguistics: Human Language

Technologies-Volume 1.

slide-81
SLIDE 81

References II

Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Riedel, S., Yao, L., and McCallum, A. (2010). Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases. Springer. Riloff, E. and Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI. Surdeanu, M., Tibshirani, J., Nallapati, R., and Manning, C. (2012). Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Xu, L., Neufeld, J., Larson, B., and Schuurmans, D. (2004). Maximum margin clustering. In NIPS.