Commonsense benchmarks Or how to measure that your model is - PowerPoint PPT Presentation

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning

How do you know that a model is doing commonsense reasoning?

How do you know that a model is doing commonsense reasoning? Unsuperv rvised : • Observe behavior, • Probe representations, • etc.

How do you know that a model is doing commonsense reasoning? Unsuperv rvised : Benchmarks : • Observe behavior, knowledge-specific tests • Probe representations, (w/ or w/o training data) • etc.

How do you know that a model is doing commonsense reasoning? Unsuperv rvised : Benchmarks : • Observe behavior, knowledge-specific tests • Probe representations, (w/ or w/o training data) • etc. QA format: easy to evaluate (e.g., accuracy)

Step 1 : Determine type of reasoning https://leaderboard.allenai.org/

Step 1 : Determine type of reasoning Abductive reasoning https://leaderboard.allenai.org/

Step 1 : Determine type of reasoning Visual commonsense Abductive reasoning reasoning https://leaderboard.allenai.org/

Reasoning about Social Situations

Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next?

Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next? run around in the mess mop up the mess

Reasoning about Social Situations Alex spilt food all over the floor and it made a huge mess. What will Alex want to do next? run around in the mess mop up the mess more likely less likely

Knowledge tested in S OCIAL IQ A : A TOMIC causes stative fall over clumsy drink too much careless X needed to no intent X is seen as X wanted to PersonX spills ___ has effect on X all over the floor X will feel gets dirty X will want clean it up embarrassed slip on the spill upset get a broom effects

Step 2 : Choosing a benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs

Step 2 : Choosing a benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs Winograd Schema Challenge (WSC), Choice of Plausible Alternatives (COPA)

Small commonsense benchmarks The city councilmen refused the demonstrators a permit Wino inograd Sc Schema because the they advocated violence. Who is “ the they ”? Chall Ch llenge (W (WSC SC) (a)The city councilmen 27 273 3 example les (b)The demonstrators The city councilmen refused the demonstrators a permit because the they feared violence. Who is “ the they ”? Choic Ch ice of f Pla lausib ible le Alt lternativ ives (C (COPA) (a)The city councilmen 50 500 0 dev, 50 500 0 test (b)The demonstrators

Small commonsense benchmarks I hung up the phone. Wino inograd Sc Schema What was the cause of this? Chall Ch llenge (W (WSC SC) (a)The caller said goodbye to me. 27 273 3 example les (b)The caller identified himself to me. The toddler became cranky. What happened as a result ? Ch Choic ice of f Pla lausib ible le Alt lternativ ives (C (COPA) (a)Her mother put her down for a nap. 50 500 0 dev, 50 500 0 test (b)Her mother fixed her hair into pigtails.

Step 2 : Choosing a QA benchmark size Small scale Large scale Creation Expert-curated Crowdsourced/automatic Coverage Limited coverage Large coverage Training Dev/test only Training/dev/test Budget Expert time costs Crowdsourcing costs Challenge : do to collect positive/negative answers?

Challenge of collecting unlikely answers

Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly

Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly • Automatic matching? • Random negative sampling won’t work, too topically different • “smart” negative sampling isn’t effective either

Challenge of collecting unlikely answers Goal: negative answers have to be pla lausib ible le but t unli likely ly • Automatic matching? • Random negative sampling won’t work, too topically different • “smart” negative sampling isn’t effective either • Need better solution… maybe we can ask crowd workers?

Collecting answers from crowdworkers Context and Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next?

Collecting answers from crowdworkers Context and Question Free Text Response Alex spilt food all over the floor and it made a huge mess. Handwritten ✔ and ✘ Answers ✔ mop up W HAT HAPPENS NEXT ✔ give up and order take out What will Alex want to ✘ leave the mess do next? ✘ run around in the mess

Collecting answers from crowdworkers Context and Question Free Text Response Alex spilt food all over the floor and it made a huge mess. Handwritten ✔ and ✘ Answers ✔ mop up W HAT HAPPENS NEXT ✔ give up and order take out What will Alex want to ✘ leave the mess do next? ✘ run around in the mess Problem: handwritten unlikely answers are too easy to detect

Problem : annotation artifacts

Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc.

Problem : annotation artifacts • Models can exploit artifacts in handwritten incorrect answers • Exaggerations, off-topic, overly emotional, etc. • See Schwartz et al. 2017, Gururangan et al. 2018, Zellers et al. 2018, etc. • Seemingly “super - human” performance by large pretrained LMs (BERT, GPT, etc.)

How to make unlikely answers robust to annotation artifacts?

How to make unlikely answers robust to annotation artifacts? S OCIAL IQ IQ A , , C OMMONSENSE QA QA: Modified answer collection

How to make unlikely answers robust to annotation artifacts? S OCIAL IQ IQ A , , C OMMONSENSE QA QA: Hella llaSwag & AF-lite: Modified answer collection Adversarial filtering of artifacts

Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor and it made a huge mess. W HAT HAPPENS NEXT What will Alex want to do next? ✔ mop up ✔ give up and order take out ✘ have slippery hands ✘ get ready to eat

Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ give up and order take out ✘ have slippery hands ✘ get ready to eat

Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ have slippery hands have slippery hands ✔ give up and order take out ✔ get ready to eat get ready to eat ✘ have slippery hands ✘ get ready to eat

Question-Switching Answers (S OCIAL IQ A ) Original Question Alex spilt food all over the floor Question-Switching Answer and it made a huge mess. W HAT HAPPENED B EFORE W HAT HAPPENS NEXT What did Alex need to do What will Alex want to do before this? next? ✔ mop up ✔ have slippery hands ✔ give up and order take out ✔ get ready to eat ✘ have slippery hands ✘ get ready to eat

Comparing incorrect/correct answers’ styles More stylistically different from correct More stylistically similar

Comparing incorrect/correct answers’ styles More stylistically Effect Size when comparing to Correct Answers different from 0.45 correct 0.40 0.35 0.30 0.25 0.20 0.15 0.10 More 0.05 stylistically 0.00 similar Arousal Dominance Valence Handwritten Incorrect Question Switching

Comparing incorrect/correct answers’ styles More stylistically Effect Size when comparing to Correct Answers different from 0.45 correct 0.40 0.35 Question switching answers are more 0.30 0.25 sty tylistically sim imilar to correct answers 0.20 0.15 0.10 More 0.05 stylistically 0.00 similar Arousal Dominance Valence Handwritten Incorrect Question Switching

C OMMONSENSE QA: pivot on knowledge graphs Talmor et al. 2019

Commonsense benchmarks Or how to measure that your model is - PowerPoint PPT Presentation

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense reasoning How do you know that a model is doing commonsense reasoning? How do you know that a model is doing commonsense reasoning? Unsuperv rvised

Which Material Design Is Commonsense . . . Possible Under Additive Commonsense . . . How

You Won't Believe This! commonsense.org/education Shareable with attribution for noncommercial

This Is Me commonsense.org/education Shareable with attribution for noncommercial use. Remixing is

This Just In! commonsense.org/education Shareable with attribution for noncommercial use. Remixing

Commonsense Knowledge in Pre-trained Language Models Vered Shwartz July 5th, 2020 Commonsense

Acquiring Comparative Commonsense Knowledge from the Web Niket Tandon Max Planck Institute for

WebChild: Harvesting and Organizing Commonsense Knowledge from Web Niket Tandon Max Planck

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Our Digital Citizenship Pledge commonsense.org/education Shareable with attribution for

Commonsense Properties from Query Logs and Question Answering Forums Julien Romero, Simon

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

STOP to Online Meanness commonsense.org/education Shareable with attribution for noncommercial

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

Demystifying Benchmarks How to Use Them to Better Evaluate Databases Peter Friedenbach,

Representing Knowledge Dustin Smith MIT Media Lab July 2008 Commonsense Computing MIT MediaLab

Establishing Realistic Investment Earnings Benchmarks What is a Benchmark? A benchmark is a

Who Are You Online? commonsense.org/education Shareable with attribution for noncommercial use.

PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan

Clicks for Cash commonsense.org/education Shareable with attribution for noncommercial use.

Challenge commonsense.org/education Shareable with attribution for noncommercial use. Remixing is

Multicore OS Benchmarks: We Can Do Better Ihor Kuz* , Zachary Anderson, Pravin Shinde, Timothy

Commonsense for Generative Multi-Hop Question Answering Tasks Lisa Bauer* Yicheng Wang* Mohit

The Power of Words commonsense.org/education Shareable with attribution for noncommercial use.