generating visual explanations
play

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - PowerPoint PPT Presentation

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5.


  1. Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil

  2. Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5. Experiments and Results 6. Critique

  3. Motivation Explainable AI : Why should we care about it? Explainability is about trust. It’s important to know why our self-driving car decided to slam on the brakes. Explanation are required for regulatory compliance in certain industries. eg: medical diagnosis, equal credit opportunity act in US Explanation can facilitate model validation and debugging . Models learn associative (not necessarily causal patterns in training data). Explanations can reveal spurious associations. But, tradeoff of performance vs explainability

  4. Motivation : Explainable Models Two broad ideas of : 1. Introspection explanation systems : which explain how a model determines its final output (eg : This is Western grebe because filter 2 has a high activation) 2. Justification explanation systems : which produce sentences detailing how visual evidence is compatible with a system output ( eg : This is Western Gerbe because it has red eyes..) Here, we look as justification explanation systems because they are more suited for non-experts . We apply the principles to the classification by visual systems Here, Applying the idea of explainability to classification by visual systems.

  5. The Problem and Importance Description : sentence based only on visual information (image captioning systems) Visual Explanation : sentence that details why a certain category is appropriate for a given image while only mentioning the image relevant features.

  6. The Approach Condition language generation on image and predicted class label. Other captioning models: condition only on visual features. For this use fine grained recognition pipeline + novel loss function to include class discriminative information. Challenge : Class specificity is a global sentence property i.e the words black or red eye are less class discriminative on their own but the entire sentence: This is an all black bird with a bright red eye is class specific to Bronzed Cowbird. Typical loss functions optimize on sentence alignment b/w generated and the ground truth.

  7. Note on LRCN

  8. Model Inputs : [image, category label, ground truth sentence]

  9. Proposed Loss Proposed loss Relevance Loss Discriminative Loss - Relevance loss (LR ) is related with "Image Relevance" - Discriminative loss ( E[R D ( w̃ )] ) is related with "Class Relevance"

  10. Relevance Loss : N = the batch size | Wt = ground truth word |I = image | C = category - Produces sentences that correspond to the image content - Does not explicitly encourage generated sentences which are both image relevant and category specific . - Class Labels : Average hidden state of another separate LSTM to generate word sequences conditioned on images. [Average across all sequences for all classes in the train set]

  11. Discriminative Loss : p( w | I,C ) = model’s estimate conditional distribution R D ( w̃ ) = reward for the sampled description E[R D ( w̃ )] = Estimation of the reward Agent = LSTM w̃ = sampled description from LSTM (p( w | I,C )) Env = previous generated words - Based on a reinforcement learning paradigm . - R D ( w̃ ) = p D (C| w̃ ) Action = predict next - p D (C| w̃ ) : pretrained sentence classifier word based on policy - The accuracy of this classifier(pretrained) is and the env. not important (22%) : sampled sentences from Policy = defined by LSTM(pL(w)) weights W

  12. Minimizing the loss - Since expectation over descriptions( E[R D ( w̃ )]) is intractable, use Monte Carlo sampling from LSTM [p( w | I,C )]. - p( w | I,C ) is a discrete distribution - To avoid differentiating R D ( w̃ ) w.r.t W use REINFORCE property Log p( w̃ )= log likelihood of the sampled - The final gradient to update weights W description L R = log likelihood of the ground truth description

  13. Dataset Caltech UCSD Birds: 200 classes of North American Bird species |11,788 images | 5 captions/image - Every image belongs to a class and therefore sentence and image are associated with single label. - Descriptive details about each bird class. - Does not explain why an image belongs to a certain class.

  14. Experiments Baseline and ablation model : Description model : generates sentences conditioned only on images - (equivalent to LRCN) Definition model : sentences using only image label as input. - Explanation-label : not trained with discriminative loss - Explanation-discriminative : not conditioned on predicted class. - Metrics : - Image relevance : METEOR, CIDEr - Class relevance : class similarity score, class rank.

  15. Results Small gain in automatic evaluation metrics for Image Relevance. But huge gains in Class relevance Metrics.

  16. Results Comparison of Explanations, Baselines, and Ablations. - Green : correct, Yellow : mostly correct, Red : incorrect - ’Red eye’ is a class relevant attribute

  17. Results Comparison of Explanations and Definitions. - Definition can produce sentences which are not image relevant

  18. Results Comparison of Explanations and Descriptions. - Both models generate visually correct sentences. - ’Black head’ is one of the most prominent distinguishing properties of this vireo type.

  19. Critique – The Good ● Motivation : ○ Novel motivation of making the models more explainable to non experts ● Explanation Model : ○ Novel loss function to include global sentence property. ○ Loss function also has a wide generic applicability. ● Ablation study: ○ Performed ablation study of all the important model components, gives reasoning behind model design decisions.

  20. Critique – The not so good ● Motivation ○ What if underlying feature in the network was not identifying red eye, it was instead identifying that there is a bird flying over water, there is no way that you would know. ● Dataset : ○ Every image belongs to a class and therefore sentence and image are associated with single label. ● Explanation Model: ○ Reduce the variance of the gradient estimation in REINFORCE by inclusion of baseline? ○ Choice of other reward functions based on class similarity and class rank? ○ Use of attention layers to combine text and image features? ● Missing details: ○ Why didn't the accuracy of the LSTM classifier matter? ● Evaluation Methodology: ○ Comparison with other SOTA image captioning models? ● Human evaluation improvements ○ Include a reason why they chose which sentence was ranked higher.

  21. References - Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell, Generating Visual Explanations, European Conference on Computer Vision (ECCV), 2016

  22. Additional Examples

Recommend


More recommend