simple and effective multi paragraph reading comprehension
play

Simple and Effective Multi-Paragraph Reading Comprehension - PowerPoint PPT Presentation

Simple and Effective Multi-Paragraph Reading Comprehension Christopher Clark and Matt Gardner Neural Question Answering Question: What color is the sky? Passage: Air is made mainly from molecules of nitrogen and oxygen. These


  1. Simple and Effective Multi-Paragraph Reading Comprehension Christopher Clark and Matt Gardner

  2. Neural Question Answering Question: “What color is the sky?” Passage: “Air is made mainly from molecules of nitrogen and oxygen. These molecules scatter the blue colors of sunlight more effectively than the green and red colors. Therefore, a clean sky appears blue.”

  3. 40 45 50 55 60 65 70 75 80 85 90 Fast Progress on Paragraph Datasets Jun-16 Jul-16 Aug-16 Sep-16 Oct-16 Nov-16 Dec-16 Jan-17 Feb-17 Accuracy on SQuAD 1.1 Mar-17 Apr-17 May-17 Jun-17 Jul-17 Aug-17 Sep-17 Oct-17 Nov-17 Dec-17 Jan-18 Feb-18 Mar-18 Apr-18 May-18 Jun-18

  4. What Next?

  5. Open Question Answering Question: “What color is the sky?” Blue Relevant Text Model Answer Span Document Retrieval

  6. Challenge: Scaling Models to Documents § Modern reading comprehension models have many layers and parameter s § The trend is continuing in this direction, for example with the use of large language models § Reduced efficiency as the paragraph length increases due to long RNN chains or transformers/self-attention modules § Limits the model to processing short paragraphs

  7. Two Possible Approaches • Pipelined Systems • Select a single paragraph from the input, and run the model on that paragraph § Confidence Systems (0.68) § Run the model on many paragraphs from the input, and have itassign a confidence score to its results on each paragraph (0.83) (0.29)

  8. This Work Improved Pipeline Method • Improve several of the key design decision that arise when training on document-level data Improved Confidence Method • Study ways to train models to produce correct confidence scores

  9. Pipeline Method: Paragraph Selection § Train a shallow linear model to select the best paragraphs § Features include TF-IDF, word occurrences, and its position within the document § If there is just one document TF-IDF alone is effective § Improves change of selecting an answering-containing paragraph from 83.0 to 85.1 on TriviaQA Web

  10. Pipeline Method: Noisy Supervision Document level data can be expected to be distantly supervised: Question: Which British general was killed at Khartoum in 1885? Passage: In February 1884 Gordon returned to the Sudan to evacuate Egyptian forces. Rebels broke into the city , killing Gordon and the other defenders. The British public reacted to his death by acclaiming ' Gordon of Khartoum , a saint. However, historians have since suggested that Gordon defied orders and….

  11. Pipeline Method: Noisy Supervision § Need a training objective that can handle multiple (noisy) answer spans § Use the summed objective from Kadlec et al (2016), that optimizes the log sum of the probability of all answer spans § Remains agnostic to how probability mass is distributed among the answer spans

  12. Pipeline Method: Model § Construct a fast, competitive model § Use some keys ideas from prior work, bidirectional-attention, self-attention, character- embeddings, variational dropout § Also added learned tokens for document and paragraphs starts § < 5 hours to train for 26 epochs on SQuAD

  13. Confidence Methods § We can derive confidence scores from the logit scores given to each span by the model, i.e., the scores given before the softmax operator is applied § Without re-training this can work poorly

  14. Example from SQuAD Question: “When is the Members Debate held?” Model Extraction: “..majority of the Scottish electorate voted for it in a referendum to be held on 1 March 1979 that represented at least... ” Correct Answer: “Immediately after Decision Time a “Members Debate” is held, which lasts for 45 minutes... ”

  15. Learning Well-Calibrated Confidence Scores § Train the model on both answering-containing and non-answering containing paragraph and use a modified objective function § Merge : Concatenate sampled paragraphs together § No-Answer : Process paragraphs independently, and allow the model to place probability mass on a “no-answer” output § Sigmoid : Assign an independent probability on each span using the sigmoid operator § Shared-Norm : Process paragraphs independently, but compute the span probability across spans in all paragraphs

  16. Results

  17. Datasets • TriviaQA : Datasets of trivia questions and related documents found by web- search • Includes three setting, Web (a single document for each questions) Wiki (multiple wikipedia documents for each questions) and Unfiltered (Multiple documents for each questions) • SQuAD: Turker-generated questions about Wikipedia articles • We use the questions paired with the entire article • Manual annotation shows most (90%) of questions are answerable as given the document it was generated from

  18. Pipeline Method: Results on TriviaQA Web 70 Baseline implementation: 60 61.1 57.2 • Uses BiDAF as the model 56.22 50 53.41 50.21 • Select paragraphs by truncating documents 40 41.08 EM • Select answer-spans randomly 30 • 72.14 EM / 81.05 F1 on SQuAD 20 • 78.58 EM / 85.83 F1 with contextualized 10 word embeddings (Peters et al., 2017) 0 TriviaQA Our +TF-IDF +Sum +TF-IDF +Model Baseline Baseline +Sum +TF-IDF +Sum

  19. TriviaQA Leaderboard (Exact Match Scores) Model Web-All Web-Verified Wiki-All Wiki-Verified Best leaderboard entry (“mingyan”) 68.65 82.44 66.56 74.83 Leaderboard entry (“dirkweissen”) 64.60 67.46 77.63 72.77 Shared-Norm (Ours) 66.37 79.97 63.99 67.98 Dynamic Integration of Background Knowledge 50.56 63.20 48.64 53.42 (Weissenborn et al., 2017a) Neural Cascades (Swayamdipta et al., 2017) 53.75 63.20 51.59 58.90 MnemonicReader (Hue et al., 2017) 46.65 56.96 46.94 54.45 SMARNET (Chen et al., 2017 51.11 40.87 42.41 50.51

  20. Error Analysis • Manually annotated 200 errors made by the TriviaQA Web model • 40.5% are due to noise or lack of context in the relevant documents • Of the remaining….

  21. Answer indirectly stated 20% Sentence Reading Missing backgroun 35% knoweldge 6% Part of answer extracted 7% Document Coreference Paragraph Reading 14% 18%

  22. Building an Open Question Answering System • Use Bing web search and a Wikipedia entity linker to locate relevant documents • Extract the top 12 paragraphs, as found using the linear paragraph ranker • Use the model trained for TriviaQA Unfiltered to find the final answer Question

  23. Demo

  24. Curated Trec Results 60 53.31 50 40 ACCURACY 37.18 34.26 30 25.7 20 10 0 YodaQA with Bing YodaQA (Baudis, DrQA + DS (Chen et S-Norm (ours) (Baudis, 2015), 2015) al., 2017a)

  25. Thank You Demo : https://documentqa.allenai.org/ Question Github : https://github.com/allenai/document-qa

Recommend


More recommend