Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney
Visual Question Answering (VQA) β’ Common VQA system What utensil is pictured? Knife (0.72) Answer Prediction Fork Visual feature set π² Original image (0.66)
Capture superficial statistical correlations between QA pairs I wonβt bother to look at the image, I What utensil is pictured? can answer your question by just looking at the question Training Answer Distribution 100 VQA 80 Knife system 60 40 20 0 Original image knife fork
Force VQA to focus on what humans focus on β’ Extract a proposal set of objects ( ) that humans focus on. There is a fork near the cake. Human textual explanation OR Proposal object set Human visual explanation
Force VQA to focus on what humans focus on β’ Enforce the gradients for the correct answer to have the largest value for at least one of the extracted objects. Influence β # π(πππ π|π , π²) Strengthen Loss Proposal object set
Results β’ Compared to baseline model on VQA-CP dataset β’ VQA-CP dataset manually set the train and test set in very different distribution VQA scores 53 48 43 38 All Baseline Ours (infl)
Over sensitivity to the most common objects I can focus on the fork but I still What utensil is pictured? think it is a knife VQA Knife system Focused objects Focused objects for answer βforkβ for answer βknifeβ
Criticizing the false influential object β’ Find the most influential object for the correct answer using gradients What utensil is pictured? Knife (0.72) Answer Prediction Fork Visual feature set π² Original image (0.66) There is a fork β # π(πππ π|π , π²) near the cake. Explaining prediction Human textual explanation βforkβ OR The most influential object Human visual explanation Proposal object set
Criticizing the false influential object β’ Force the object to contribute more to the correct answer. What utensil is pictured? Explaining prediction βknifeβ Knife (0.72) β # π(πππππ|π , π²) Answer Prediction Fork Visual feature set π² Original image (0.66) There is a fork β # π(πππ π|π , π²) near the cake. Self Critical Loss Explaining prediction Human textual explanation βforkβ OR The most influential object Human visual explanation Proposal object set
Our self-critical approach What utensil is pictured? Oh, yes, the utensil should be a fork. VQA Fork system
Results β’ Compared to baseline model on VQA-CP dataset VQA scores 52 50 48 46 44 42 40 38 All Baseline Ours (infl) Ours (infl + crit)
Recommend
More recommend