Slide Credits:Agrawal
Slide Credits:Agrawal
Slide Credits:Agrawal
Kolmogorov-Smirnov Test p(Captions vs (Q+A))<0.001
LSTM : one hidden layer MLP : 2 hidden layer fc network output size 1024 1000 dropout(0.5) units tanh each word size 300 end-to-end learning cross-entropy Deeper LSTM: two hidden layer output : 2048 > fc+tanh >1024 Input Vocabulary : All question words
2-Channel VQA Model Neural Network Image Embedding Softmax over top K answers 4096-dim Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Embedding Question “How many horses are in this image?” 1024-dim Slide Credits:Agrawal
Ablation #1: Language-alone Neural Network Embedding Image Softmax 1k output over top K answers units Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding “How many horses are in this image?” 1024-dim Slide Credits:Agrawal
Ablation #2: Vision-alone Neural Network Image Embedding Softmax over top K answers 4096-dim Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP + Non-Linearity + Non-Linearity Question Embedding “How many horses are in this image?” Slide Credits:Agrawal
Slide Credits:Agrawal
Slide Credits:Agrawal
Current Leaderboard
Questions&Discussion&Demo
Recommend
More recommend