Attending knowledge facts with BERT-like models in Question-Answering : disappointing results and some explanations Guillaume Le Berre & Philippe Langlais 33rd Canadian Conference on Artificial Intelligence May 13 to May 15
Question Answering: a challenging task Why question answering? Core task of Natural Language Processing One of the most challenging task for deep learning models Necessary to achieve true AI interactions with humans A lot of available benchmarks 2 / 22
Extractive question answering example: SQUAD The question are adjoined with a reference text Models are required to select a span of text containing the answer Modern deep learning models reached human performances Model EM F1 Human Performance 86.8 89.4 SA-Net on Albert (ensemble) 89.7 93.0 Retro-Reader (ensemble) 90.6 93.0 ALBERT + DAAF + Verifier (ensemble) 90.4 92.8 3 / 22
OpenBookQA Multiple choice questions (4 answer choices) with no reference text Models need to rely on general world knowledge Human performance not yet achieved despite recent improvements Dataset is adjoined with Science and Common knowledge facts Model Accuracy Human Performance 0.92 UnifiedQA 0.87 TTTTT 0.83 KF+SIR 0.80 4 / 22
AI2 Reasoning Challenge (ARC) Similar to OpenBookQA Multiple choice questions (4 answer choices) with no reference text Models need to rely on general world knowledge Divided into two parts: ”Easy” and ”Challenge” Model Accuracy UnifiedQA 0.79 FreeLB-RoBERTa 0.68 arcRoberta, erenup 0.67 5 / 22
General knowledge: Learned... Most current state of the art models do not use common knowledge facts provided Major drawbacks: Low generalization capacity Require a lot of data 6 / 22
...vs extracted from a database Teaching a model how to search for information in a database: possibly allows an easier generalization by adding domain specific information into the database requires less annotated data 7 / 22
Pretrained models: BERT Pretrained on BookCorpus and Wikipedia Transformer model Provides a contextual embedding of the words in a sentence 8 / 22
Sentence BERT (SBERT) Additional pretraining on SNLI dataset From A and B, learn to predict entailment, neutral or contradiction SBERT is supposed to capture the semantic of sentences 9 / 22
Model CAT: vanilla BERT The question (+ eventual additional knowledge facts) is concatenated to every answers choices Each question/answer sequence is embedded using BERT Base The embedding are sent through a few linear layers to get a scalar score for each answer choice The model is trained using a cross-entropy loss With this setup on OpenBookQA, BERT Base and Large are expected to obtain around 55% and 60% accuracy respectively 10 / 22
Experiment 1: Biases in OpenBookQA In order to understand what part of the question BERT is using while answering the questions, we removed part of the questions (no knowledge facts given) accuracy Full question (baseline) 55.8% Last 4 tokens only 52.0% 11 / 22
Experiment 1: Biases in OpenBookQA In order to understand what part of the question BERT is using while answering the questions, we removed part of the questions (no knowledge facts given) accuracy Full question (baseline) 55.8% Last 4 tokens only 52.0% Without the question 51.2% 12 / 22
Experiment 1: Biases in OpenBookQA In order to understand what part of the question BERT is using while answering the questions, we removed part of the questions (no knowledge facts given) accuracy Full question (baseline) 55.8% Last 4 tokens only 52.0% Without the question 51.2% The model is thus able to differentiate between right and wrong answers using information inherent to the answers themselves Similar biases exist in ARC Accuracy of around 36% on ”Easy” and ”Challenge” without the question 13 / 22
Potential biases We have identified 2 biases in the dataset: Right answers are in average longer Right answers generally contains less frequent words Dummy models that select the longest answer or the one with the least frequent word obtain 33% and 37% accuracy respectively 14 / 22
Potential biases We have identified 2 biases in the dataset: Right answers are in average longer Right answers generally contains less frequent words Dummy models that select the longest answer or the one with the least frequent word obtain 33% and 37% accuracy respectively Question What impacts an objects ability to reflect light? Answer choices A: color pallete B: weights C: height D: smell 4 tokens ...ability to reflect light? 15 / 22
Model ATT: attention over facts Concatenating the knowledge facts to the question often results in long sequences and thus a large memory usage Using an attention mechanism over the BERT embeddings of the knowledge facts allows to use more complex architectures and eventually to pre-compute the embeddings in advance 16 / 22
Experiment 2: Semantic significance In this second experiment, we compare the representation provided by BERT and SBERT when trying to apply an attention mechanism We have for each question: Gold fact - A particular science fact that is relevant Other facts - A list of 9 facts automatically selected by word overlap We compare 2 setups: CAT - Vanilla setup in which the knowledge facts are concatenated to the questions ATT - Attention setup in which the facts are embedded with BERT first and then used in an attention mechanism 17 / 22
Results When using model CAT the results of BERT and SBERT are similar accuracy of 55.8% and 53.2% respectively with no additional knowledge facts increase to 64.5% and 63.6% respectively when adding the gold fact 18 / 22
Results When using model CAT the results of BERT and SBERT are similar accuracy of 55.8% and 53.2% respectively with no additional knowledge facts increase to 64.5% and 63.6% respectively when adding the gold fact With an attention (model ATT): with BERT, the model is unable to use the additional knowledge even if the gold fact is provided (alone or among the other facts) with SBERT, we observe some improvements compared to a SBERT model with no knowledge facts (accuracy of 55% with only the gold fact and 54.8% with the gold fact among other facts) 19 / 22
Results When using model CAT the results of BERT and SBERT are similar accuracy of 55.8% and 53.2% respectively with no additional knowledge facts increase to 64.5% and 63.6% respectively when adding the gold fact With an attention (model ATT): with BERT, the model is unable to use the additional knowledge even if the gold fact is provided (alone or among the other facts) with SBERT, we observe some improvements compared to a SBERT model with no knowledge facts (accuracy of 55% with only the gold fact and 54.8% with the gold fact among other facts) For SBERT, keeping only the end of the question (last 4 tokens) improves the results when using additional knowledge facts Model ATT with SBERT thus obtains an accuracy of more than 61% with only gold fact and nearly 57% with gold fact given among other facts 20 / 22
Conclusion We have to put in perspective the results of machine learning models on OpenBookQA It becomes increasingly important to understand how deep learning models are making their decision It gives an opportunity to work on bias reduction for question answering 21 / 22
The End
Recommend
More recommend