Convex relaxations for weakly supervised information extraction - - PowerPoint PPT Presentation
Convex relaxations for weakly supervised information extraction - - PowerPoint PPT Presentation
Convex relaxations for weakly supervised information extraction Edouard Grave Columbia University edouard.grave@gmail.com Information Extraction Extract structured information from unstructured documents. Information Extraction Extract
Information Extraction
Extract structured information from unstructured documents.
Information Extraction
Extract structured information from unstructured documents.
Example: named entity recognition
Detect and classify mentions of named entities in text. The seven-month re-examination of why U.S. forces were caught
- ff-guard by the Japanese attack was done at the request of Sen.
Strom Thurmond, R-S.C., chairman of the Senate Armed Services Committee, and members of the Kimmel family.
Example: named entity recognition
Detect and classify mentions of named entities in text. The seven-month re-examination of why [U.S.]LOC forces were caught off-guard by the Japanese attack was done at the request of
- Sen. [Strom Thurmond]PER, R-[S.C.]LOC, chairman of the [Senate
Armed Services Committee]ORG, and members of the [Kimmel]PER family. Traditionally, detect mentions of
- people (PER),
- locations (LOC),
- organizations (ORG).
Example: named entity recognition
Detect and classify mentions of named entities in text. The seven-month re-examination of why [U.S.]LOC forces were caught off-guard by the Japanese attack was done at the request of
- Sen. [Strom Thurmond]PER, R-[S.C.]LOC, chairman of the [Senate
Armed Services Committee]ORG, and members of the [Kimmel]PER family. Named entities can also be:
- genes, cells, proteins, etc.
- books, movies, games, etc.
- laptops, phones, camera, etc.
Example: entity linking
Link an entity mention (e.g. Michael Jordan) to a knowledge base
Example: relation extraction
Extract binary relations between named entities from text During World War II, Turing worked for the Government Code and Cypher School (GC&CS) at Bletchley Park.
Example: relation extraction
Extract binary relations between named entities from text During World War II, Turing worked for the Government Code and Cypher School (CG&CS) at Bletchley Park.
Example: relation extraction
Extract binary relations between named entities from text During World War II, Turing worked for the Government Code and Cypher School (CG&CS) at Bletchley Park. Employee(Alan Turing, CG&CS) Contains(Bletchley Park, CG&CS)
Challenges of information extraction
Most state-of-the-art methods: supervised machine learning.
Challenges of information extraction
Most state-of-the-art methods: supervised machine learning.
- Needs (a lot of) labeled data:
- expensive to obtain (need expertise),
- thousands different kinds of entities / relations,
- ressources for English. But French? Spanish? Russian?
Challenges of information extraction
Most state-of-the-art methods: supervised machine learning.
- Needs (a lot of) labeled data:
- expensive to obtain (need expertise),
- thousands different kinds of entities / relations,
- ressources for English. But French? Spanish? Russian?
- Not robust to domain shift:
Our distribution agreement with [Henry Schein]PER renews annu- ally unless terminated by either party.
- I. Relation extraction
Distant supervision for relation extraction
Craven and Kumlien (1999); Mintz et al. (2009)
Knowledge base r e1 e2 BornIn Lichtenstein New York City DiedIn Lichtenstein New York City Sentences Roy Lichtenstein was born in New York City, into an upper-middle-class family. In 1961, Leo Castelli started displaying Lichten- stein’s work at his gallery in New York. Roy Lichtenstein died of pneumonia in 1997 in New York City.
Distant supervision for relation extraction
Craven and Kumlien (1999); Mintz et al. (2009)
Knowledge base r e1 e2 BornIn Lichtenstein New York City DiedIn Lichtenstein New York City Sentences Roy Lichtenstein was born in New York City, into an upper-middle-class family. In 1961, Leo Castelli started displaying Lichten- stein’s work at his gallery in New York. Roy Lichtenstein died of pneumonia in 1997 in New York City.
Distant supervision for relation extraction
Craven and Kumlien (1999); Mintz et al. (2009)
Knowledge base r e1 e2 BornIn Lichtenstein New York City DiedIn Lichtenstein New York City Sentences Latent label Roy Lichtenstein was born in New York City, into an upper-middle-class family. BornIn In 1961, Leo Castelli started displaying Lichten- stein’s work at his gallery in New York. None Roy Lichtenstein died of pneumonia in 1997 in New York City. DiedIn
Multiple instance, multiple label learning
Bunescu and Mooney (2007); Riedel et al. (2010); Hoffmann et al. (2011); Surdeanu et al. (2012)
(Lichtenstein, New York City) Roy Lichtenstein was born in New York City. Lichtenstein left New York to study in Ohio. BornIn DiedIn
Multiple instance, multiple label learning
Bunescu and Mooney (2007); Riedel et al. (2010); Hoffmann et al. (2011); Surdeanu et al. (2012)
(Lichtenstein, New York City) Roy Lichtenstein was born in New York City. Lichtenstein left New York to study in Ohio. BornIn DiedIn N pair mentions represented by vec- tors xn I pairs of entities pi K relations Ein Rik Ein = 1 if pair mention n corresponds to entity pair i Rik = 1 if entity pair i verifies relation k
Overview
Two steps procedure:
1 infer labels for each pair mention; 2 train supervised instance level relation extractor.
Goal: infer a binary matrix Y such that:
- Ynk = 1 if pair mention n express relation k;
- Ynk = 0 otherwise.
Approach based on discriminative clustering.
(a) Discriminative clustering
Discriminative clustering
Xu et al. (2004); Bach and Harchaoui (2007)
Discriminative clustering
Xu et al. (2004); Bach and Harchaoui (2007)
Discriminative clustering
Xu et al. (2004); Bach and Harchaoui (2007)
Discriminative clustering
Xu et al. (2004); Bach and Harchaoui (2007)
Given a loss function ℓ and a regularizer Ω: min
Y
min
f N
- n=1
ℓ(yn, f (xn)) + Ω(f ), s.t. Y ∈ Y
(b) Weak supervision by constraining Y
Weak supervision by constraining Y
Each pair mention express exactly one relation:
Weak supervision by constraining Y
Each pair mention express exactly one relation: ∀n ∈ {1, ..., N},
K+1
- k=1
Ynk = 1.
Weak supervision by constraining Y
If entity pair i verifies relation k, then at least one pair mention n corresponding to the pair i express that relation:
Weak supervision by constraining Y
If entity pair i verifies relation k, then at least one pair mention n corresponding to the pair i express that relation: ∀(i, k) such that Rik = 1,
- n : Ein=1
Ynk ≥ 1. Ein = 1 if pair mention n corresponds to entity pair i
Weak supervision by constraining Y
If entity pair i verifies relation k, then at least one pair mention n corresponding to the pair i express that relation: ∀(i, k) such that Rik = 1,
N
- n=1
EinYnk ≥ 1. Ein = 1 if pair mention n corresponds to entity pair i
Weak supervision by constraining Y
If entity pair i does not verify relation k, then no pair mention n corresponding to pair i express that relation:
Weak supervision by constraining Y
If entity pair i does not verify relation k, then no pair mention n corresponding to pair i express that relation: ∀(i, k) such that Rik = 0,
- n : Ein=1
Ynk = 0. Ein = 1 if pair mention n corresponds to entity pair i
Weak supervision by constraining Y
If entity pair i does not verify relation k, then no pair mention n corresponding to pair i express that relation: ∀(i, k) such that Rik = 0,
N
- n=1
EinYnk = 0. Ein = 1 if pair mention n corresponds to entity pair i
Weak supervision by constraining Y
For a given entity pair i, at most c percent of pair mentions classified as none:
Weak supervision by constraining Y
For a given entity pair i, at most c percent of pair mentions classified as none: ∀i ∈ {1, ..., I},
N
- n=1
EinYn(K+1) ≤ c
N
- n=1
Ein,
Weak supervision by constraining Y
These constraints are equivalent to: Y1 = 1, (EY) ◦ S ≥ R.
(c) Problem formulation
Problem formulation
Using linear classifiers W ∈ RD×(K+1) and the squared loss: min
Y,W
1 2Y − XW2
F + λ
2 W2
F,
s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R.
Problem formulation
Using linear classifiers W ∈ RD×(K+1) and the squared loss: min
Y,W
1 2Y − XW2
F + λ
2 W2
F,
s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. Closed form solution for W: W = (X⊤X + λID)−1X⊤Y.
Problem formulation
Replacing W by its optimal value: min
Y
1 2tr
- Y⊤(XX⊤ + λIN)−1Y
- ,
s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R.
Problem formulation
Replacing W by its optimal value: min
Y
1 2tr
- Y⊤(XX⊤ + λIN)−1Y
- ,
s.t. Y ∈ {0, 1}N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. This is a quadratic integer program. Hard to solve in general.
Convex relaxation
Relaxing the constraints Y ∈ {0, 1}N×(K+1) into Y ∈ [0, 1]N×(K+1): min
Y
1 2tr
- Y⊤(XX⊤ + λIN)−1Y
- ,
s.t. Y ∈ [0, 1]N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. This is a convex quadratic program.
Convex relaxation
Relaxing the constraints Y ∈ {0, 1}N×(K+1) into Y ∈ [0, 1]N×(K+1): min
Y
1 2tr
- Y⊤(XX⊤ + λIN)−1Y
- ,
s.t. Y ∈ [0, 1]N×(K+1) Y1 = 1, (EY) ◦ S ≥ R. This is a convex quadratic program. It only depends on the kernel XX⊤.
Rounding
Given a solution of the relaxed problem Y, orthogonal projection on:
- M ∈ {0, 1}N×(K+1) | M1 = 1
- .
Consists in taking the argmax along the rows of Y.
Optimization
We optimize the dual because:
- there is no matrix inverse (easy to compute the gradient),
- constraints are simpler (easy to project on the constraints).
We use accelerated projected gradient descent algorithm (FISTA). Overall complexity: O(NFK).
- N is the number of pair mentions (sentences);
- F is the average number of features;
- K is the number of classes.
(d) Experiments
Experiments: dataset
Dataset introduced by Riedel et al. (2010):
- Articles from the New York Times corpus.
- Entities extracted using Stanford named entities recognizer.
- Entity mentions aligned to Freebase using a string match.
There are
- 52 relations,
- 4, 200 entity pairs,
- 120, 000 pair mentions.
Experiments: features
We use the features proposed by Mintz et al. (2009):
- Lexical features, such as:
- sequence of words between entities;
- window of k words before/after the first/second entity;
- corresponding part-of-speeches;
- Syntactic features, such as:
- path in the dependency tree between the two entities;
- neighbors of the two entities that are not in the path.
Experiments: results
0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Mintz et al. (2009)
Figure: Precision/recall curves for different methods on the Riedel et al. (2010) dataset, for the task of aggregate extraction.
Experiments: results
0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Mintz et al. (2009) Hoffmann et al. (2011) Surdeanu et al. (2012)
Figure: Precision/recall curves for different methods on the Riedel et al. (2010) dataset, for the task of aggregate extraction.
Experiments: results
0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Mintz et al. (2009) Hoffmann et al. (2011) Surdeanu et al. (2012) This work
Figure: Precision/recall curves for different methods on the Riedel et al. (2010) dataset, for the task of aggregate extraction.
Experiments: results
0.0 0.1 0.2 0.3 0.4 0.5 0.6 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision /location/location/contains /people/person/place_lived /person/person/nationality /people/person/place_of_birth /business/person/company
Figure: Precision/recall curves per relation for our method, for the task of aggregate extraction, on the Riedel et al. (2010) dataset.
Experiments: results
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Hoffmann et al. (2011) This work Figure: Precision/recall curves for the task of sentential extraction, on the manually labeled dataset of Hoffmann et al. (2011).
- II. Named entity classification
Motivation
Extract named entities from technical text (e.g. financial reports).
Motivation
Extract named entities from technical text (e.g. financial reports). Limitations of state-of-the-art NER:
- lack of labeled data for technical domains,
- suffers from domain shift.
Examples of errors (from financial reports of healthcare companies):
- Henry Schein classified as Person,
- Aspen classified as Location.
Both are healthcare companies.
Motivation
Extract named entities from technical text (e.g. financial reports). Limitations of state-of-the-art NER:
- lack of labeled data for technical domains,
- suffers from domain shift.
Examples of errors (from financial reports of healthcare companies):
- Henry Schein classified as Person,
- Aspen classified as Location.
Both are healthcare companies. For some domains: named entities are (almost) not ambiguous.
Bootstrapping for named entity extraction
Riloff and Jones (1999); Collins and Singer (1999)
Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.
- St. Jude Medical acquired Endosense for $171 mil-
lion in net cash consideration.
Bootstrapping for named entity extraction
Riloff and Jones (1999); Collins and Singer (1999)
Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.
- St. Jude Medical acquired Endosense for $171 mil-
lion in net cash consideration.
Bootstrapping for named entity extraction
Riloff and Jones (1999); Collins and Singer (1999)
Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.
- St. Jude Medical acquired Endosense for $171 mil-
lion in net cash consideration. Patterns: [COMPANY] acquired acquired [COMPANY]
Bootstrapping for named entity extraction
Riloff and Jones (1999); Collins and Singer (1999)
Seed list: Merck, Endosense. Sentences In August 2014, Merck acquired Idenix for approxi- mately $3.85 billion in cash.
- St. Jude Medical acquired Endosense for $171 mil-
lion in net cash consideration. Patterns: [COMPANY] acquired acquired [COMPANY]
Overview
Input: seed list of entities and unlabeled text.
1 Extract potential named entity mentions (sequences of continuous
tokens tagged as NNP or NNPS),
2 Try to match each mention to a seed, using exact string match, 3 Train a multiclass classifier using those examples.
Overview
Input: seed list of entities and unlabeled text.
1 Extract potential named entity mentions (sequences of continuous
tokens tagged as NNP or NNPS),
2 Try to match each mention to a seed, using exact string match, 3 Train a multiclass classifier using those examples.
Step 3: positive and unlabeled examples only; no negative examples. Instance of PU-learning (learning from positive and unlabeled examples).
Notations
- xn: vectors describing the named entity mentions;
- P: set of indices of positive examples;
- U: set of indices of unlabeled examples;
- cn: labels of the postive examples (K + 1 corresponds to Other).
Notations
- xn: vectors describing the named entity mentions;
- P: set of indices of positive examples;
- U: set of indices of unlabeled examples;
- cn: labels of the postive examples (K + 1 corresponds to Other).
Infer a binary matrix Y ∈ {0, 1}N×(K+1) such that: Ynk = 1 if named entity mention n is of type k,
- therwise.
and corresponding classifier f such that: f (xn) = yn.
Weak supervision by constraining Y
Each named entity mention belongs to exactly one class: ∀n ∈ {1, ..., N},
K+1
- k=1
Ynk = 1.
Weak supervision by constraining Y
For positive examples, Y agrees with distant supervision: ∀n ∈ P, Yncn = 1.
Weak supervision by constraining Y
Impose percentage of examples classified as Other to be at least p:
- n∈U
Yn(K+1) ≥ pN.
Problem formulation
Using linear classifiers W ∈ RD×(K+1) and the squared loss: min
Y,W
1 2Y − XW2
F + λ
2 W2
F
s.t. Y ∈ {0, 1}N×(K+1), Y ∈ Y. This is a quadratic integer program. Hard to solve in general.
Convex relaxation
Relaxing the constraints Y ∈ {0, 1}N×(K+1) into Y ∈ [0, 1]N×(K+1): min
Y,W
1 2Y − XW2
F + λ
2 W2
F
s.t. Y ∈ [0, 1]N×(K+1), Y ∈ Y, This problem is jointly convex in Y and W.
Experiments: dataset and features
Data: financial reports corresponding to healthcare companies. Seed lists:
- 578 publicly traded healthcare companies,
- 200 most searched drugs on the website www.rxlist.com.
Features:
- lowercased tokens of the mention,
- window of k words to the left/right of the mention,
- k ancestors (with syntactic roles) in the dependency tree,
- vectorial representation of the mention.
Experiments: results
Companies Drugs P R F1 P R F1 Stanford NER N/A 52.6 N/A N/A N/A N/A String match 98.9 44.2 61.1 100 32.3 48.8 SVM (asym) 87.0 92.8 89.8 86.5 79.2 82.7 This work 82.9 95.8 88.9 87.4 94.0 90.6
Experiments: results
0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Companies
precision recall f1 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Drugs
precision recall f1
Figure: Influence of parameter p.
Conclusion
Distant supervision for information extraction:
- based on discriminative clustering, using distant supervision as
constraints;
- leads to a convex formulation;
- competitive to state-of-the-art on relation extraction.
Conclusion
Distant supervision for information extraction:
- based on discriminative clustering, using distant supervision as
constraints;
- leads to a convex formulation;
- competitive to state-of-the-art on relation extraction.
Work in progress:
- faster optimization methods for our approach;
- kernelization of our methods;
- generalize to ambiguous named entities.
References
A convex relaxation for weakly supervised relation extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. Weakly supervised named entity classification. Proceedings of the Workshop on Automated Knowledge Base Construction (AKBC), 2014.
References
A convex relaxation for weakly supervised relation extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. Weakly supervised named entity classification. Proceedings of the Workshop on Automated Knowledge Base Construction (AKBC), 2014. Code: Available on my webpage but “research code”. Might release more general code in the future.
Thank you for your attention!
References I
Bach, F. and Harchaoui, Z. (2007). Diffrac: a discriminative and flexible framework for clustering. In NIPS. Bunescu, R. and Mooney, R. (2007). Learning to extract relations from the web using minimal supervision. In Annual meeting-association for Computational Linguistics. Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In EMNLP. Craven, M. and Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. In ISMB. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., and Weld, D. (2011). Knowledge-based weak supervision for information extraction
- f overlapping relations. In Proceedings of the 49th Annual Meeting
- f the Association for Computational Linguistics: Human Language