Adversarial Reward Learning for Visual Storytelling Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang Maria Fabiano
Outline 1. Motivation 2. AREL Model Overview 3. Policy Model 4. Reward Model 5. AREL Objective 6. Data 7. Training and Testing 8. Evaluation 9. Critique
Motivation The authors want to explore how well a computer can create a story from a set of images. Up to the point of this paper, little research had been done in visual storytelling. Visual storytelling represents a deeper understanding of images. This goes beyond image captioning because it requires understanding more complicated visual scenarios, relating sequential images, and associating implicit concepts in the image (e.g., emotions).
Motivation Problems with previous storytelling approaches RL ● Hand-crafted rewards (e.g., METEOR) are too biased or too sparse to drive the policy search ○ Fail to learn implicit semantics (coherence, expressiveness, etc) ○ Require extensive feature and reward engineering ○ GANs ● Prone to unstable gradients or vanishing gradients ○ IRL ● Maximum margin approaches, probabilistic approaches ○
AREL Model Overview dversarial ward earning Policy model : produces the ● story sequence from an image sequence Reward model : learns implicit ● reward function from human-annotated stories and sampled predictions Alternately train the models ● via SGD
Policy Model Takes an image sequence and sequentially chooses words from the vocabulary to create a story. ● Images go through pre-trained CNN ● Encoder (bidirectional GRUs) gets high-level features of images ● Five decoders (single-layer GRU, shared weights) create five substories ● Concatenate substories
(Partial) Reward Model Aims to derive a human-like reward from human-annotated stories and sampled predictions.
Adversarial Reward Learning: Reward Boltzmann Distribution W = Story R θ = Reward function Z θ = Partition function (a normalizing constant) p θ = Approximate data distribution We achieve the optimal reward function R* when the Reward-Boltzmann distribution p θ equals the actual data distribution p*
Adversarial Reward Learning We want the Reward Boltzmann distribution p θ to get close the actual data distribution p*. Adversarial objective : min-max two player game ● Maximize the similarity of p θ with the empirical distribution p e while minimizing the similarity of p θ with the generated data from the policy π β . Meanwhile, π β wants to maximize its similarity with p θ . Distribution similarity is measured using KL-divergence. ● The objective of the reward is to distinguish between human-annotated ● stories and machine-generated stories. Minimize KL-divergence with p e and maximize KL-divergence with π β ○ The objective of the policy is to create stories indistinguishable from ● human-written stories. Minimize KL-divergence with p θ ○
Data VIST dataset of Flickr photos aligned to stories ● One sample is a story for five images from a photo album ● The same album is paired with five different stories as references ● Vocabulary of size 9,837 words (words have to appear more than three times ● in the training set)
Training and Testing 1. Create a baseline model XE-ss (cross-entropy loss with scheduled sampling) with the same architecture as the policy model a. Scheduled sampling uses a sampling probability to decide which action to take 2. Use XE-ss to initialize the policy model 3. Train with AREL framework
Training and Testing Objective of the policy model: ● maximize similarity with p θ Objective of the reward model: ● distinguish between human-generated and machine-generated stories Alternate between training policy ● and reward using SGD N = 50 or 100 ○ For testing, policy uses beam ● search to create the story
Automatic Evaluation AREL achieves SOTA on all metrics except ROUGE; however, these gains are ● very small, and are very similar to the baseline model and vanilla GAN Gain 2.2 1.1 0.2 2.1 Range of new 1.7 0.9 0.9 0.9 0.4 0.4 0.9 methods
Human Evaluation AREL greatly outperforms all other models in human evaluations: Turing test ● Relevance ● Expressiveness ● Concreteness ● Comparison of Turing test results
Critique The “Good” AREL – novel framework of adversarial reward learning to tell stories ● SOTA on VIST dataset and automatic metrics ● Automatic metrics are not great for training or evaluation (empirically shown) ● Comprehensive human evaluation via Turk ● Better results on relevance, expressiveness, and concreteness ○ Clear description of how human evaluation was conducted ○
Critique The “Not so Good” Motivation : interesting problem to solve, but what are practical applications? ● Limited to five photos in a story ○ XE-ss : not mentioned until evaluation section, but it initializes AREL ● Partial rewards : more discussion and motivation needed for this approach ● Missing details ● Type of pooling in reward model is not specified (average? max?) ○ Fine tuning the pre-trained ResNet? ○ ● Data bias (gender and event), and the model augments the largest majority’s influence Small gain in automatic evaluation metrics, and XE-ss performs similarly to AREL; no ● direct comparison of human evaluation between AREL and previous methods Human evaluation improvements ● Include a reason why they chose which sentence was machine-generated or not ○ Rankings instead of pairwise comparisons ○ Decoder shared weights : maybe there is something specific about an image’s position ● that requires different weights (e.g., the structure of a narrative: setting, problem, rising action, climax, falling action, resolution)
Recommend
More recommend