“Going on a vacation” takes longer than “Going for a walk” : A Study of Temporal Commonsense Understanding Ben Zhou Daniel Khashabi* Qiang Ning* Dan Roth *Currently affiliated with AI2
Temporal Common Sense n Humans assume information when reading ¨ Not explicitly mentioned ¨ Related to time n Happens all the time ¨ To better understand the storyline and beyond 2
Temporal Common Sense My friend Bill went to Duke University in North Carolina. With a degree in CS, he joined Google MTV as a software engineer. As a huge basketball fan, he has attended all 3 NBA finals since then. He also plans to visit Duke regularly as an alumnus to attend their home games. 3
Temporal Common Sense My friend Bill went to Duke College : about 4 years, start at the age of 18 Typical Time University in North Carolina. With a Duration degree in CS, he joined Google MTV Bill in North Carolina : about 4 years as a software engineer. As a huge Duration Stationarity basketball fan, he has attended all 3 Duke in North Carolina : always (expected) NBA finals since then. He also plans Stationarity to visit Duke regularly as an alumnus to attend their home games. 4
Temporal Common Sense My friend Bill went to Duke College : about 4 years, start at the age of 18 Typical Time University in North Carolina. With a Duration degree in CS, he joined Google MTV Bill in North Carolina : about 4 years as a software engineer. As a huge Duration Stationarity basketball fan, he has attended all 3 Duke in North Carolina : always (expected) NBA finals since then. He also plans Stationarity to visit Duke regularly as an alumnus Join Google : after college graduation to attend their home games. Ordering 5
Temporal Common Sense My friend Bill went to Duke College : about 4 years, start at the age of 18 Typical Time University in North Carolina. With a Duration degree in CS, he joined Google MTV Bill in North Carolina : about 4 years as a software engineer. As a huge Duration Stationarity basketball fan, he has attended all 3 Duke in North Carolina : always (expected) NBA finals since then. He also plans Stationarity to visit Duke regularly as an alumnus Join Google : after college graduation to attend their home games. Ordering NBA Finals : every year Frequency 6
Temporal Common Sense My friend Bill went to Duke College : about 4 years, start at the age of 18 Typical Time University in North Carolina. With a Duration degree in CS, he joined Google MTV Bill in North Carolina : about 4 years as a software engineer. As a huge Duration Stationarity basketball fan, he has attended all 3 Duke in North Carolina : always (expected) NBA finals since then. He also plans Stationarity to visit Duke regularly as an alumnus Join Google : after college graduation to attend their home games. Ordering NBA Finals : every year Frequency Visit Alma Mater : 0-2 times per year, 0-2 days each time Frequency Duration Attend basketball games : a few hours Duration 7
Temporal Commonsense My friend Bill went to Duke • Q: How old is Bill? University in North Carolina. With a • A: Around 25. • R: 3 + 4 + 18 degree in CS, he joined Google MTV as a software engineer. As a huge basketball fan, he has attended all 3 • Q: How long will take Bill to fly to Duke? NBA finals since then. He also plans • A: A few (1-5) hours. to visit Duke regularly as an alumnus • R: Duke is always in NC, Bill is now in CA to attend their home games. • Q: How often would he visit Duke in the future? • A: A few (<5) times a year. * Human infer temporal common sense that helps them to better understand the story. • Q: Which one happened first, went or joined? • A: Went. 8
Our Contribution n MC-TACO 🌯 ( m ultiple c hoice t empor a l co mmon-sense) : ¨ A dataset that focuses on temporal commonsense ¨ Input: Gold Prediction He went to Duke University. How long did it take him to graduate? 4 years ✔ He went to Duke University. How long did it take him to graduate? 10 days ✔ ✗ 3.5 years ¨ Task: Decide whether each answer is plausible. ✔ 16 hours ¨ Metrics: Reading Comprehension : able to answer any questions regarding a piece of text n Exact Match: the percentage of question of which all candidates are predicted correctly ✔ 1 century n F1: The F1 score of “plausible” Exact Match : able to label all candidate answers of a question F1: 66.7 ¨ Statistics: Exact Match: 0.0 n 1,893 questions n 13,225 question-answer pairs ¨ Conclusion: current systems are not enough to solve this. 9
MC-TACO: Construction n Step 0: Source Sentence Generation He joined Google as a software engineer after graduating from college. ¨ Randomly samples sentences n Step 1: Question Generation ¨ Ask people to write questions n A) temporal Will he work at How long did he Google for the rest n B) non-extractive stay in college? of his life? ¨ To require commonsense Duration Duration Stationarity ¨ Ask for one “plausible” answer 4 years No 10
MC-TACO: Construction n Step 2: Question Verification He joined Google as a software engineer after graduating from college. ¨ 2 additional verifications on each question ¨ 100% agreement How long did he What did he do stay in college? after college? ¨ We also ask for n 1 “plausible” answer Temporal? n 1 “implausible” answer Yes Yes Non- extractive? Yes No 11
MC-TACO: Construction n Step 3: Candidate Answer He joined Google as a software engineer after graduating from college. Expansion ¨ Seed answers from step 1+2 What happened ¨ Expand candidates automatically How long did he after he started stay in college? n Perturbations working? n Information Retrieval He started making 4 years money. He started a 6 years factory. He contributed to 11 days public services. … … 12
MC-TACO: Construction n Step 4: Answer Labeling He joined Google as a software engineer after graduating from college. ¨ Each answer is labeled by 4 different annotators ¨ Either “likely” or “unlikely” What happened How long did he after he started stay in college? ¨ Enforce 100% agreement working? or He started making n Eliminate marginal answers with 4 years money. “intermediate” probability He started a 6 years factory. He contributed to 11 days public services. … … 13
Results F1 Exact Match F1 Exact Match Human F1 Human Exact Match 100 90 13% drop 80 70 72.3 69.9 66.1 60 40% drop 3 weeks -> 0.75 months 54.9 50 50.3 49.8 40 43.6 42.7 39.6 30 26.4 20 20.9 17.4 10 0 Naïve Best ESIM + GloVe ESIM + ELMo BERT BERT + Unit RoBERTa (post Normalization publication) ESIM: Enhanced LSTM for Natural Language Inference (Chen et al., 2016) 26% GloVe: Global Vectors for Word Representation (Pennington et al., 2014) Surface improvement ELMo: Deep contextualized word representations (Peters et al., 2018) over +GloVe Association BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019) 14
Summary n Define 5 temporal commonsense phenomena n Present MC-TACO, a QA dataset focused on temporal commonsense n Show that existing systems are not enough to solve it n Encourage further research n Thanks! Leaderboard GitHub (data, baseline, evaluator) https://leaderboard.allenai.org/mctaco/ https://github.com/CogComp/MCTACO 15
Recommend
More recommend