squad 100 000 questions for machine comprehension of text
play

SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav - PowerPoint PPT Presentation

SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang Published in EMNLP 2016 Presented by Jiaming Shen April 17, 2018 1 SQuAD = S tanford Qu estion A nswering D ataset Online


  1. SQuAD:100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang Published in EMNLP 2016 Presented by Jiaming Shen April 17, 2018 1

  2. SQuAD = S tanford Qu estion A nswering D ataset Online challenge: https://rajpurkar.github.io/SQuAD-explorer/ 2

  3. Overall contribution • A benchmark dataset with: • Proper difficulty • Principled curation process • Detailed data analysis 3

  4. Outline • What are the QA datasets prior to SQuAD? • What does SQuAD look like? • How is SQuAD created? • What are the properties of SQuAD? • How well we can do on SQuAD? 4

  5. What are the QA datasets prior to SQuAD? 5

  6. Related Datasets Type I: Complex reading comprehension datasets Type II: Open-domain QA datasets Type III: Cloze datasets 6

  7. Type I: Complex Reading Comprehension Datasets • Require commonsense knowledge, very challenge • Dataset size too small 7

  8. Type II: Open-domain QA Datasets • Open-domain QA: answer a question from a large collection of documents. • WikiQA: only sentence selection • TREC-QA: free-form answer -> hard to evaluate 8

  9. Type III: Cloze Datasets • Automatically generated -> large scale • Limitations are described in ACL 2016 Best Paper. 9

  10. What does SQuAD look like? 10

  11. SQuAD Dataset Format A passage One QA pair 11

  12. SQuAD Dataset Format • One passage can have multiple question-answer pairs. • Totally 100,000+ QA pairs from 23,215 passages. 12

  13. How is SQuAD created? 13

  14. SQuAD Dataset Collection • Consisting three steps: • Step1: Passage curation • Step2: Question-answer collection • Step3: Additional answers collection 14

  15. Step 1: Passage Curation • Select top 10000 articles of English Wikipedia based on Wikipedia’s internal PageRanks scores. • Random sample 536 articles out of 10000 articles. • Extract passages longer than 500 characters from all 536 articles -> 23,115 paragraphs. • Train/dev/test datasets are split in the article level. • Train/dev datasets are released and test dataset is holdout. 15

  16. Step 2: Question-Answer Collection • Using crowdsourcing technique • Crowd-workers with 97% HIT acceptance rate, larger than 1000 HITs, and located in USA/Canada. • Spend 4 minutes on each paragraph and asking up to 5 questions with answer highlighted in the text. 16

  17. Step 2: Question-Answer Collection 17

  18. Step 3: Additional Answers Collection • For each question in dev/test datasets, get at least two additional answers. • Why we do this? • Make evaluation more robust. • Assess human performance. 18

  19. What are the properties of SQuAD? 19

  20. Data Analysis • Diversity in answers • Reasoning for answering questions • Syntactic divergence 20

  21. Diversity in Answers • 67.4% non-entity answers, and many answers are not even noun -> Can be challenging. 21

  22. Reasoning for answering questions 22

  23. Syntactic divergence • Syntactic divergence is the minimum edit distance (belong all unlexicalized dependency path) over all possible anchors (word-lemma pairs). 23

  24. Syntactic divergence • Histogram of syntactic divergence 24

  25. How well we can do on SQuAD? 25

  26. “Baseline” method • Candidate answer generation: use constituency parser. • Feature extraction • Train a Logistic Regression Model 26

  27. Help to pick the correct sentence Resolve lexical variations Resolve syntactic variations 27

  28. Evaluation • After ignoring punctuations and articles, using the following two metrics: • Exact Match • Macro-averaged F1 score: maximum F1 over all of the ground truth answers 28

  29. Experiment results • Overall results For SQuAD v1.1 Test: EM: 82.304 F1: 91.221 29

  30. Experiment results • Performance stratified by answer type 30

  31. Experiment results • Performance stratified by syntactic divergence 31

  32. Experiment results • Performance with feature ablations 32

  33. Summary • SQuAD is a machine reading style QA dataset. • SQuAD consists of 100,000+ QA pairs. • SQuAD is constructed based on crowdsourcing. • SQuAD drives the field forward. 33

  34. Thanks Q & A 34

Recommend


More recommend