commonsense benchmarks
play

Commonsense benchmarks Or how to measure that your model is - PowerPoint PPT Presentation

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning How do you know that a model is doing commonsense reasoning? How do you know that a model is doing commonsense reasoning? Unsuperv rvised


  1. Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning

  2. How do you know that a model is doing commonsense reasoning?

  3. How do you know that a model is doing commonsense reasoning? Unsuperv rvised : • Observe behavior, • Probe representations, • etc.

  4. How do you know that a model is doing commonsense reasoning? Unsuperv rvised : Benchmarks : • Observe behavior, knowledge-specific tests • Probe representations, (w/ or w/o training data) • etc.

  5. How do you know that a model is doing commonsense reasoning? Unsuperv rvised : Benchmarks : • Observe behavior, knowledge-specific tests • Probe representations, (w/ or w/o training data) • etc. QA format: easy to evaluate (e.g., accuracy)

  6. Step 1 : Determine type of reasoning https://leaderboard.allenai.org/

  7. Step 1 : Determine type of reasoning Abductive reasoning https://leaderboard.allenai.org/

  8. Step 1 : Determine type of reasoning Visual commonsense Abductive reasoning reasoning https://leaderboard.allenai.org/

  9. Reasoning about Social Situations

  10. Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

  11. Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next? run around in the mess mop up the mess

  12. Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next? run around in the mess mop up the mess more likely less likely

  13. Knowledge tested in S OCIAL IQ A : A TOMIC causes stative fall over clumsy drink too much careless X needed to no intent X is seen as X wanted to PersonX spills ___ has effect on X all over the floor X will feel gets dirty X will want clean it up embarrassed slip on the spill upset get a broom effects

  14. Step 2 : Choosing a benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs

  15. Step 2 : Choosing a benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs Winograd Schema Challenge (WSC), Choice of Plausible Alternatives (COPA)

  16. Small commonsense benchmarks The city councilmen refused the demonstrators a permit Wino inograd Sc Schema because the they advocated violence. Who is “ the they ”? Chall Ch llenge (W (WSC SC) (a)The city councilmen 27 273 3 example les (b)The demonstrators The city councilmen refused the demonstrators a permit because the they feared violence. Who is “ the they ”? Choic Ch ice of f Pla lausib ible le Alt lternativ ives (C (COPA) (a)The city councilmen 50 500 0 dev, 50 500 0 test (b)The demonstrators

  17. Small commonsense benchmarks I hung up the phone. Wino inograd Sc Schema What was the cause of this? Chall Ch llenge (W (WSC SC) (a)The caller said goodbye to me. 27 273 3 example les (b)The caller identified himself to me. The toddler became cranky. What happened as a result ? Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA) (a)Her mother put her down for a nap. 50 500 0 dev, 50 500 0 test (b)Her mother fixed her hair into pigtails.

  18. Step 2 : Choosing a QA benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs Challenge : do to collect positive/negative answers?

  19. Challenge of collecting unlikely answers

  20. Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly

  21. Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly • Automatic matching? • Random negative sampling won’t work, too topically different • “smart” negative sampling isn’t effective either

  22. Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly • Automatic matching? • Random negative sampling won’t work, too topically different • “smart” negative sampling isn’t effective either • Need better solution… maybe we can ask crowd workers?

  23. Collecting answers from crowdworkers Context and Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next?

  24. Collecting answers from crowdworkers Context and Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next?

  25. Collecting answers from crowdworkers Context and Question Free Text Response Alex spilt food all over the floor and it made a huge mess. Handwritten ✔ and ✘ Answers ✔ mop up W HAT HAPPENS NEXT ✔ give up and order take out What will Alex want to ✘ leave the mess do next? ✘ run around in the mess

  26. Collecting answers from crowdworkers Context and Question Free Text Response Alex spilt food all over the floor and it made a huge mess. Handwritten ✔ and ✘ Answers ✔ mop up W HAT HAPPENS NEXT ✔ give up and order take out What will Alex want to ✘ leave the mess do next? ✘ run around in the mess Problem: handwritten unlikely answers are too easy to detect

  27. Problem : annotation artifacts

  28. Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc.

  29. Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc. • Seemingly “super - human” performance by large pretrained LMs (BERT, GPT, etc.)

  30. Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc. • Seemingly “super - human” performance by large pretrained LMs (BERT, GPT, etc.)

  31. How to make unlikely answers robust to annotation artifacts?

  32. How to make unlikely answers robust to annotation artifacts? S OCIAL IQ IQ A , , C OMMONSENSE QA QA: Modified answer collection

  33. How to make unlikely answers robust to annotation artifacts? S OCIAL IQ IQ A , , C OMMONSENSE QA QA: Hella llaSwag & AF-lite: Modified answer collection Adversarial filtering of artifacts

  34. Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next? ✔ mop up ✔ give up and order take out ✘ have slippery hands ✘ get ready to eat

  35. Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ give up and order take out ✘ have slippery hands ✘ get ready to eat

  36. Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ have slippery hands have slippery hands ✔ give up and order take out ✔ get ready to eat get ready to eat ✘ have slippery hands ✘ get ready to eat

  37. Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ have slippery hands ✔ give up and order take out ✔ get ready to eat ✘ have slippery hands ✘ get ready to eat

  38. Comparing incorrect/correct answers’ styles More stylistically different from correct More stylistically similar

  39. Comparing incorrect/correct answers’ styles More stylistically Effect Size when comparing to Correct Answers different from 0.45 correct 0.40 0.35 0.30 0.25 0.20 0.15 0.10 More 0.05 stylistically 0.00 similar Arousal Dominance Valence Handwritten Incorrect Question Switching

  40. Comparing incorrect/correct answers’ styles More stylistically Effect Size when comparing to Correct Answers different from 0.45 correct 0.40 0.35 Question switching answers are more 0.30 0.25 sty tylistically sim imilar to correct answers 0.20 0.15 0.10 More 0.05 stylistically 0.00 similar Arousal Dominance Valence Handwritten Incorrect Question Switching

  41. C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019

  42. C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019

  43. C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019

Recommend


More recommend