Interpretability and Robustness for Multi-Hop QA
Mohit Bansal
(MRQA-EMNLP 2019 Workshop)
1
Interpretability and Robustness for Multi-Hop QA Mohit Bansal - - PowerPoint PPT Presentation
Interpretability and Robustness for Multi-Hop QA Mohit Bansal (MRQA-EMNLP 2019 Workshop) 1 Multihop-QAs Diverse Requirements Interpretability and Modularity Multiple Reasoning Chains Assembling Adversarial Shortcut Robustness
(MRQA-EMNLP 2019 Workshop)
1
Interpretability and Modularity Adversarial Shortcut Robustness Scalability and Data Augmentation Commonsense/External Knowledge Multiple Reasoning Chains Assembling
2
3
4
5
“Which NFL team represented the AFC at Super Bowl 50?”
Question
[Rajpurkar et al., 2016]
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015
(AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers …
Context
“Denver Broncos”
Answer
Modeling Layer Output Layer Attention Flow Layer Contextual Embed Layer Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM LSTM LSTM LSTM
Start End
h1 h2 hT u1 u2 uJ
Softmax
h1 h2 hT u1 u2 uJ
Max Softmax
Context2Query Query2Context
h1 h2 hT u1 uJ
LSTM + Softmax Dense + Softmax
Context Query
Query2Context and Context2Query Attention
Word Embedding GLOVE Char-CNN
Character Embed Layer
Character Embedding
g1 g2 gT m1 m2 mT
6
[Seo et al., 2017]
7
[Yang et al., 2018]
Question Context
Kasper Schmeichel Peter Schmeichel Bridge Entity Kasper Schmeichel is a Danish professional footballer ... He is the son of former Manchester United and Danish international goalkeeper Peter Schmeichel. “What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?” Peter Bolesław Schmeichel is a Danish former professional footballer … was voted the IFFHS World's Best Goalkeeper in 1992 … World’s Best Goalkeeper
8
[Yang et al., 2018]
“Were Scott Derrickson and Ed Wood of the same nationality?”
Question
Scott Derrickson is an American director ...
Context
Edward Wood Jr. was an American filmmaker ... Yes Scott Derrickson America Ed Wood America
9
World’s Best Goalkeeper Kasper Schmeichel Peter Schmeichel Bridge Entity Yes Scott Derrickson America Ed Wood America
“Were Scott Derrickson and Ed Wood of the same nationality?” “What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
d
To achieve this, we need:
to use for each sub-question).
10
[Jiang and Bansal, EMNLP 2019]
11
Neural Modular Network was originally proposed to solve Visual Question Answering (VQA), including VQA dataset and CLEVR dataset (Andreas et
[Jiang and Bansal, EMNLP 2019]
12
The original NMN controllers are usually trained with RL. Hu et al. (2018) proposed stack-based NMN w/ soft module execution to avoid indifferentiability in optimization
every step.
from/onto a stack.
[Jiang and Bansal, EMNLP 2019]
13
Inputs: Question emb: u, Sub-question vector: , Context emb: h
Module Name Input Attention Output Types Implementation Details
Find(u, c, h)
(None) Attention
Relocate(u, c, h)
a1 Attention
Compare(u, c, h)
a1, a2 Yes/No
NoOp(u, c, h)
(None) (None) (None)
[Jiang and Bansal, EMNLP 2019]
14
[Jiang and Bansal, EMNLP 2019]
15
[Jiang and Bansal, EMNLP 2019]
16
Controller
Scott Derrickson is an American director. Edward Wood Jr. was an American filmmaker. Prediction: Yes Q: Were Scott Derrickson and Ed Wood of the same nationality?
Sub-question Module weights
Modular Network
all modules
RNN
all modules
all modules
findrelcmp nop findrelcmp nop findrelcmp nop
f i n d r e l c m p n
findrelcmp nop findrelcmp nop
Stack of Attention Push Push Pop
[Jiang and Bansal, EMNLP 2019]
17
Dev Test F1 F1
BiDAF Baseline
57.19 55.81
Original NMN
40.28 39.90
Our NMN
63.35 62.71
[Jiang and Bansal, EMNLP 2019]
18
Bridge Comparison F1 F1
Our NMN
64.49 57.20
60.13 58.10
64.46 56.00
*All models are evaluated on our dev set.
[Jiang and Bansal, EMNLP 2019]
19
Train Reg Reg Adv Adv Eval Reg Adv Reg Adv BiDAF Baseline 43.12 34.00 45.12 44.65 Our NMN 50.13 44.70 49.33 49.25
[Jiang and Bansal, EMNLP 2019]
20
What government position was held by the woman who portrait Corliss Archer in the film Kiss and Tell Step 1: Step 2: re ston neiacs d s
Kiss and Tell is a 1945 American comedy film starring then 17-year-old Shirley Temple as Corliss Archer. ... Step 1: Shirley Temple Black was an American actress, ..., and also served as Chief of Protocol of the United States. Step 2:
[Jiang and Bansal, EMNLP 2019]
decomposition/attention.
21
Ctrl Step 1: Ctrl Step 2:
Ctrl Step 3:
[Jiang and Bansal, EMNLP 2019]
22
“Were Scott Derrickson and Ed Wood of the same nationality?” “What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
Find -> Relocate: 99.9% Find -> Find -> Compare: 4.8 % Find -> Relocate -> Compare: 63.8%
[Jiang and Bansal, EMNLP 2019]
23
“Were Scott Derrickson and Ed Wood of the same nationality?” “What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
Bridge-Type: Find -> Find -> Compare: 4.8 % 96.9% Find -> Relocate -> Compare: 63.8% 0% Comparison Yes/No: Find -> Relocate: 99.9%
BERT models (& non-BERT NMNs), but while maintaining BERT-style numbers.
[Jiang and Bansal, EMNLP 2019]
24
“Were Scott Derrickson and Ed Wood of the same nationality?” “What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
Bridge-Type: Find -> Find -> Compare: 4.8 % 96.9% Find -> Relocate -> Compare: 63.8% 0% Comparison Yes/No: Find -> Relocate: 99.9%
BERT models (& non-BERT NMNs), but while maintaining BERT-style numbers.
Still several challenges/ long way to go, e.g., more complex MultihopQA datasets with more hops, more types of reasoning behaviors, etc.!
See Yichen’s full talk on Nov7 10.30am!
[Jiang and Bansal, EMNLP 2019]
25
[Welbl et al. 2018]
[Jiang, Joshi, Chen, Bansal, ACL 2019a]
large pool of documents
reasoning chain to predict the answer
predict the answer
26 [Jiang, Joshi, Chen, Bansal, ACL 2019a]
27
Document Explorer (DE): Iteratively selects relevant documents and represents multiple reasoning chains in a tree structure Answer Proposer (AP): Proposes a candidate answer from every ancestor-aware root-to-leaf chain in the reasoning tree Evidence Assembler (EA): Extracts key sentences from every reasoning chain and combines them to make a unified prediction
Query Subject
...
( aware)
proposed candidate 0 proposed candidate 1 proposed candidate 4
A sentence in containing candidate 0 A sentence in containing candidate 1 A sentence in containing candidate 4
synthesized context Final prediction
Attention
DE AP AP AP AP BiDAF EA
Values:
softmax
Keys:
sampling
I Hiearchical, Key-value Memory Network:
DE
... { , , ... , } ( aware) ( aware) A sentence in containing query subject Query Body document-reasoning tree
... ... ... ... ...
Figure 2: The full architecture of our 3-module system EPAr, with the Document Explorer (DE, left), Answer
[Jiang, Joshi, Chen, Bansal, ACL 2019a]
28
WikiHop MedHop
[Jiang, Joshi, Chen, Bansal, ACL 2019a]
29
chains is recovered in the top-k root-to-leaf paths in the reasoning tree
[Jiang, Joshi, Chen, Bansal, ACL 2019a]
30
chains is recovered in the top-k root-to-leaf paths in the reasoning tree
Still several challenges/ long way to go, e.g., more complex MultihopQA datasets with more hops, longer and more #reasoning chains, etc.!
[Jiang, Joshi, Chen, Bansal, ACL 2019a]
31
[Jiang and Bansal, ACL 2019] 32
Kasper Schmeichel Peter Schmeichel World’s Best Goalkeeper
son of
voted as
Bridge En*ty Ques*on En*ty Answer “What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
Ques&on Reasoning Chain: Reasoning Shortcut:
World’s Best Goalkeeper
voted as
Answer [Placeholder]
[Jiang and Bansal, ACL 2019] 33
“What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
Ques&on Context
Peter Bolesław Schmeichel is a Danish former professional footballer .., and was voted the IFFHS World's Best Goalkeeper in 1992 and 1993. Edson Arantes do Nascimento is a retired Brazilian professional footballer. In 1999, he was voted World Player
Kasper Hvidt is a Danish retired handball goalkeeper, .. also voted as Goalkeeper of the Year March 20, 2009, [Missing: 1992, IFFHS]
The answer can be directly inferred by word-matching the documents to maximum
[Jiang and Bansal, ACL 2019] 34
[Jiang and Bansal, ACL 2019]
Min et al., 2019; Chen & Durrett, 2019
35
“What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?”
Ques&on Context
Peter Bolesław Schmeichel is a Danish former professional footballer .., and was voted the IFFHS World's Best Goalkeeper in 1992 and 1993.
professional footballer .., and was voted the IFFHS World's Best Defender in 1992 and 1993.
Adversarial Document A model exploiSng the reasoning shortcut will now find two plausible answers!
[Jiang and Bansal, ACL 2019] 36
Train \ Eval Eval = Regular Eval = Adv Train = Regular 89.44 44.67 Train = Adv 89.03 80.14
evaluated on the adversarial data.
* Exact-Match scores between 2 golden documents and 2 retrieved documents
[Jiang and Bansal, ACL 2019] 37
Train \ Eval Eval = Regular Eval = Adv Train = Regular 89.44 44.67 Train = Adv 89.03 80.14
in adversarial evaluaSon.
not to take the reasoning shortcut, but there is sSll a remaining drop in performance. * Exact-Match scores between 2 golden documents and 2 retrieved documents
[Jiang and Bansal, ACL 2019] 38
Train \ Eval Eval = Regular Eval = Adv Train = Regular 43.12 34.00 Train = Adv 45.12 44.65
evaluated on the adversarial data.
shortcut instead of performing mulS-hop reasoning. * Exact-Match scores
[Jiang and Bansal, ACL 2019] 39
Train \ Eval Eval = Regular Eval = Adv Train = Regular 43.12 34.00 Train = Adv 45.12 44.65
EM score in adversarial evaluaSon.
it not to take the reasoning shortcut, but sSll big room for improvement. * Exact-Match scores
[Jiang and Bansal, ACL 2019] 40
documents
[Jiang and Bansal, ACL 2019] 41
[Cubuk et al., 2018] [Niu and Bansal, EMNLP 2019]
How do we automatically generate the best adversaries without manual design? Our AutoAugment model consists of a controller and a target model. The controller first samples a policy that transforms the original data to augmented data, on which the target model retrains. After training, the target model is evaluated to obtain the performance on the validation set. This performance is then fed back to the controller as the reward signal.
Controller sample Policy Data Aug-data training Model performance reward (R) perturb transform
Figure 1: The controller samples a policy to perturb the training data. After training on the augmented inputs, the model feeds the performance back as reward. Figure 3: AutoAugment controller. An input-agnostic controller corresponds to the lower part of the figure. It samples a list of
an encoder (upper part) that takes in the source inputs of the data.
S3 S2 S1 Encoder Decoder Source Operation
Changes
Probability <Start>
Ribeiro et al., 2018; Zhao et al., 2018
42
[Niu and Bansal, EMNLP 2019]
Policy Hierarchy and Search Space:
Number of Changes (the maximum # of times allowed to perform operation, and Probability of applying that operation.
Dropout, Paraphrase, Grammar Errors, and Stammer. Subdivision of Operations:
nuanced combinations of operations, divide Stopword Dropout into 7 categories: Noun, Adposition, Pronoun, Adverb, Verb, Determiner, and Other.
(verb inflected/base form confusion).
I have three beautiful kids. I have three beautiful kids. I have three lovely children. 0.3 0.7 0.6 0.4 0.6 0.4 Op1: (P, 2, 0.7) Op2: (G, 1, 0.4) I have three beautiful kids. I have three lovely child. I have three lovely children. I have three beautiful kid.
Figure 2: Example of a sub-policy applied to a source
paraphrases the input twice with probability 0.7.
43
[Niu and Bansal, EMNLP 2019]
task (Lowe et al., 2015); REINFORCE (Williams, 1992; Sutton et al., 2000) to train the controller.
We also conducted human studies on Mturk, comparing each of the input-agnostic/aware models with the VHRED baseline and All-operations from Niu and Bansal (2018).
Table 1: Activity, Entity F1 results reported by previous work, the All-operations and AutoAugment models. Table 2: Human evaluation results on comparisons among the baseline, All-
Table 4: Top 3 policies on the validation set and their test performances. Operations: R=Random Swap, D=Stopword Dropout, P=Paraphrase, G=Grammar Errors, S=Stammer. Universal tags: n=noun, v=verb, p=pronoun, adv=adverb, adp=adposition. 44
[Niu and Bansal, EMNLP 2019]
task (Lowe et al., 2015); REINFORCE (Williams, 1992; Sutton et al., 2000) to train the controller.
We also conducted human studies on Mturk, comparing each of the input-agnostic/aware models with the VHRED baseline and All-operations from Niu and Bansal (2018).
Table 1: Activity, Entity F1 results reported by previous work, the All-operations and AutoAugment models. Table 2: Human evaluation results on comparisons among the baseline, All-
Table 4: Top 3 policies on the validation set and their test performances. Operations: R=Random Swap, D=Stopword Dropout, P=Paraphrase, G=Grammar Errors, S=Stammer. Universal tags: n=noun, v=verb, p=pronoun, adv=adverb, adp=adposition.
Still several challenges: better AutoAugm algorithms for RL speed, reward sparsity,
Nov5 3.30pm for more details!
45
it has not seen before (via automatic question generation)!
semantics-enhanced rewards (QPP = Question Paraphrasing Probability & QAP = Question Answering Probability) and introduce a QA-based QG evaluation method.
[Zhang and Bansal, EMNLP 2019]
... H ... ...
QPC QA
Environment QG Agent
reward (QPP & QAP) sampled question
QPC
Groundtruth (gt): in what year was a master of arts course first
Generated (gen): when did the university begin offering a master
0.46 Context: ...the university first offered graduate degrees , in the form of a master of arts ( ma ) , in the the 1854 – 1855 academic year ...
QG QA
Generated (gen): in what year did common sense begin publication ? Context: ...in 1987 , when some students believed that the observer began to show a conservative bias , a liberal newspaper , common sense was published... 0.94, 1987 Context: ...in 1987 , when some students believed that the observer began to show a conservative bias , a liberal newspaper , common sense was published...
QG
pqpc(is para = true|qgt, qgen) pqa(a|qgen, context); qgen ∼ pqg(q|a, context)
46
QG QA
Model-generated questions Human-labeled questions Question answering probability New or existing paragraphs Existing paragraphs
when did the observer begin to show a conservative bias? .. in 1987, when some students believed that the observer began to show a conservative bias, a liberal newspaper, common sense was was published … .. in 1987, when some students show a conservative bias, a liberal newspaper, common sense was was published … believed that the observer began to in what year did the student paper common sense begin publication?
Data Filter
Augment QA dataset with QG-generated examples (Generate from Existing Articles, and Generate from New Articles) (1) QAP filter: To filter out poorly-generated examples; Filter synthetic examples with QAP < 𝜁. (2) Mixing mini-batch training: To make sure that the gradients from ground-truth data are not
data with half mini-batch synthetic data.
47 [Zhang and Bansal, EMNLP 2019]
QG QA
Model-generated questions Human-labeled questions Question answering probability New or existing paragraphs Existing paragraphs
when did the observer begin to show a conservative bias? .. in 1987, when some students believed that the observer began to show a conservative bias, a liberal newspaper, common sense was was published … .. in 1987, when some students show a conservative bias, a liberal newspaper, common sense was was published … believed that the observer began to in what year did the student paper common sense begin publication?
Data Filter
Augment QA dataset with QG-generated examples (Generate from Existing Articles, and Generate from New Articles) (1) QAP filter: To filter out poorly-generated examples; Filter synthetic examples with QAP < 𝜁. (2) Mixing mini-batch training: To make sure that the gradients from ground-truth data are not
data with half mini-batch synthetic data.
Still several challenges: need higher diversity in generated questions, better/ automatic filters for semi-supervised QA,
48 [Zhang and Bansal, EMNLP 2019]
[Bauer, Wang, and Bansal, EMNLP 2018]
"What is the connection between Esther and Lady Dedlock?" "Mother and daughter." "Sir Leicester Dedlock and his wife Lady Honoria live on his estate at Chesney Wold.." "..Unknown to Sir Leicester, Lady Dedlock had a lover .. before she married and had a daughter with him.." "..Lady Dedlock believes her daughter is dead. The daughter, Esther, is in fact alive.." "..Esther sees Lady Dedlock at church and talks with her later at Chesney Wod though neither woman recognizes their connection.."
2c
lady
1c
3c
4c
5c
1r
2r
3r
4r
Context Answers Question ConceptNet
wife marry mother daughter child church house child their person lover "Mother and illegitimate child."
8).
BiDAF Attention
Bi-LSTM ; ;
NOIC Reasoning Cell
Context Bi-LSTM Commonsense Relations Query
w
1 CS, ..., w
l CSw
2 CS,
Reasoning Layer
Context Query
Commonsensse
Bypass
select
answer un- repre-
commonsense, and essentially learn when to fill ‘gaps’ of reasoning and with what information
49
(BERT).
sparsity, etc.
50
51
Webpage: http://www.cs.unc.edu/~mbansal/ Email: mbansal@cs.unc.edu UNC-NLP Lab: http://nlp.cs.unc.edu/ Postdoc Openings!!: ~mbansal/postdoc-advt-unc-nlp.pdf
52