Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning
How do you know that a model is doing commonsense reasoning?
How do you know that a model is doing commonsense reasoning? Unsuperv rvised : • Observe behavior, • Probe representations, • etc.
How do you know that a model is doing commonsense reasoning? Unsuperv rvised : Benchmarks : • Observe behavior, knowledge-specific tests • Probe representations, (w/ or w/o training data) • etc.
How do you know that a model is doing commonsense reasoning? Unsuperv rvised : Benchmarks : • Observe behavior, knowledge-specific tests • Probe representations, (w/ or w/o training data) • etc. QA format: easy to evaluate (e.g., accuracy)
Step 1 : Determine type of reasoning https://leaderboard.allenai.org/
Step 1 : Determine type of reasoning Abductive reasoning https://leaderboard.allenai.org/
Step 1 : Determine type of reasoning Visual commonsense Abductive reasoning reasoning https://leaderboard.allenai.org/
Reasoning about Social Situations
Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?
Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next? run around in the mess mop up the mess
Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next? run around in the mess mop up the mess more likely less likely
Knowledge tested in S OCIAL IQ A : A TOMIC causes stative fall over clumsy drink too much careless X needed to no intent X is seen as X wanted to PersonX spills ___ has effect on X all over the floor X will feel gets dirty X will want clean it up embarrassed slip on the spill upset get a broom effects
Step 2 : Choosing a benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs
Step 2 : Choosing a benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs Winograd Schema Challenge (WSC), Choice of Plausible Alternatives (COPA)
Small commonsense benchmarks The city councilmen refused the demonstrators a permit Wino inograd Sc Schema because the they advocated violence. Who is “ the they ”? Chall Ch llenge (W (WSC SC) (a)The city councilmen 27 273 3 example les (b)The demonstrators The city councilmen refused the demonstrators a permit because the they feared violence. Who is “ the they ”? Choic Ch ice of f Pla lausib ible le Alt lternativ ives (C (COPA) (a)The city councilmen 50 500 0 dev, 50 500 0 test (b)The demonstrators
Small commonsense benchmarks I hung up the phone. Wino inograd Sc Schema What was the cause of this? Chall Ch llenge (W (WSC SC) (a)The caller said goodbye to me. 27 273 3 example les (b)The caller identified himself to me. The toddler became cranky. What happened as a result ? Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA) (a)Her mother put her down for a nap. 50 500 0 dev, 50 500 0 test (b)Her mother fixed her hair into pigtails.
Step 2 : Choosing a QA benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs Challenge : do to collect positive/negative answers?
Challenge of collecting unlikely answers
Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly
Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly • Automatic matching? • Random negative sampling won’t work, too topically different • “smart” negative sampling isn’t effective either
Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly • Automatic matching? • Random negative sampling won’t work, too topically different • “smart” negative sampling isn’t effective either • Need better solution… maybe we can ask crowd workers?
Collecting answers from crowdworkers Context and Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next?
Collecting answers from crowdworkers Context and Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next?
Collecting answers from crowdworkers Context and Question Free Text Response Alex spilt food all over the floor and it made a huge mess. Handwritten ✔ and ✘ Answers ✔ mop up W HAT HAPPENS NEXT ✔ give up and order take out What will Alex want to ✘ leave the mess do next? ✘ run around in the mess
Collecting answers from crowdworkers Context and Question Free Text Response Alex spilt food all over the floor and it made a huge mess. Handwritten ✔ and ✘ Answers ✔ mop up W HAT HAPPENS NEXT ✔ give up and order take out What will Alex want to ✘ leave the mess do next? ✘ run around in the mess Problem: handwritten unlikely answers are too easy to detect
Problem : annotation artifacts
Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc.
Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc. • Seemingly “super - human” performance by large pretrained LMs (BERT, GPT, etc.)
Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc. • Seemingly “super - human” performance by large pretrained LMs (BERT, GPT, etc.)
How to make unlikely answers robust to annotation artifacts?
How to make unlikely answers robust to annotation artifacts? S OCIAL IQ IQ A , , C OMMONSENSE QA QA: Modified answer collection
How to make unlikely answers robust to annotation artifacts? S OCIAL IQ IQ A , , C OMMONSENSE QA QA: Hella llaSwag & AF-lite: Modified answer collection Adversarial filtering of artifacts
Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next? ✔ mop up ✔ give up and order take out ✘ have slippery hands ✘ get ready to eat
Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ give up and order take out ✘ have slippery hands ✘ get ready to eat
Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ have slippery hands have slippery hands ✔ give up and order take out ✔ get ready to eat get ready to eat ✘ have slippery hands ✘ get ready to eat
Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ have slippery hands ✔ give up and order take out ✔ get ready to eat ✘ have slippery hands ✘ get ready to eat
Comparing incorrect/correct answers’ styles More stylistically different from correct More stylistically similar
Comparing incorrect/correct answers’ styles More stylistically Effect Size when comparing to Correct Answers different from 0.45 correct 0.40 0.35 0.30 0.25 0.20 0.15 0.10 More 0.05 stylistically 0.00 similar Arousal Dominance Valence Handwritten Incorrect Question Switching
Comparing incorrect/correct answers’ styles More stylistically Effect Size when comparing to Correct Answers different from 0.45 correct 0.40 0.35 Question switching answers are more 0.30 0.25 sty tylistically sim imilar to correct answers 0.20 0.15 0.10 More 0.05 stylistically 0.00 similar Arousal Dominance Valence Handwritten Incorrect Question Switching
C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019
C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019
C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019
Recommend
More recommend