Visually Grounded Meaning Representation Qi Huang Ryan Rock
Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment: Categorization 7. Takeaway & Critiques
Motivation ● Word embedding is “disembodied”: represent word meaning only as the statistical pattern of its surrounding context, without grounding to their real-world referents, which is usually accessible in modality other than text Problem: the distribution of word representation overfits (and ● fundamentally limited by) the training corpus’ statistical patterns; the learned representation cannot generalize well
Motivation Question: Can we take other modality information as input when building word representation?
Motivation Solution: Cognitive Science study shows that semantic attributes can represents ● multisensory information of a word not present in the surrounding text. ○ Example: apples are “green”, “red”, “round”, “shiny”. Model: a (stacked)autoencoder based model that learn high-level meaning ● representations by mapping words and images (represented as attributes ) into a common hidden space ● Verification: experiment on world similarity and concept categorization
Autoencoders An autoencoder is an unsupervised feed-forward neural network which is trained to reconstruct a given input from its hidden distribution. Encoder : map input vector x to a hidden representation h Decoder : reconstruct x (y) from hidden representation x Minimize the reconstruction loss:
Denoising in autoencoder Denoising: reconstruct clean input given a corrupted input. For example, randomly mask out some elements in the input. Effect: learn to activate knowledge about a concept when being exposed to partial information
Stacked autoencoders ● Stacked autoencoders is essentially a collection of autoencoder “stacked” on top of each other To initialize weight for each layer, train a collection of autoencoder, one at a ● time. Feed one autoencoder’s output as the next autoencoder’s input ● Fine tune the model end-to-end afterwards with unsupervised training criterion (global reconstruction) or supervised criterion
Visually Grounded Autoencoders Train text autoencoder and image autoencoder with two hidden layers separately
Visually Grounded Autoencoders Feed their respective encoding as input to obtain the bimodal encoding
Visually Grounded Autoencoders Finetune the whole model with global reconstruction loss, and label prediction as a supervised signal The bimodal encoding is used as the final word representation
Some training details: ● Weights of each AE are tied (encoder and decoder) : ● Denoising for image modality: treat x itself as “corrupted” and the “clean” version is the centroid of multiple image embedding containing that object
Constructing visual & textual attribute representation In our case, the two modalities are unified in their representation by natural language attributes (in vector forms) Goal: allow to generalize to new instances when there are no training example available
Constructing visual attribute representation ● VISA dataset, built from McRae feature norms and images from ImageNet that represent McRae’s concepts. ● McRae feature norm: a collection of concepts with vectorized representation, each entry corresponds to a property of that concept ● 541 concepts represented by 700K images. Each concept has 5 (“prune”) - 2,149 (“closet”) images
Constructing visual attribute representation The concepts and attributes essentially form a bipartite graph
Constructing visual attribute representation Train a SVM classifier for each attributes, images with that attribute as positive examples; otherwise as negative examples on image processed by hand-crafted feature extractor. Concept’s visual attribute representation is the average of all consisting images’ visual vectors
Constructing textual attribute representation ● Text attributes obtained from Strudel, “a distributional model akin to other vector-based models except that collocates of a concept are established by relations to other concepts interpreted as properties” Vector representation: each element represent the concept-attribute pair’s ● strength of association ● Attributes are automatically discovered by the Strudel algorithm from training corpus Also take word embeddings from continuous skip-gram model for ● comparison
Visual Attributes & Textual Attributes Comparison
Experiments: Similarity and Categorization
Experiment 1: Similarity ● Create dataset by pairing concrete McRae nouns and assigning semantic and visual similarity scores from 1 to 5
Comparison Models ● Compare against ○ Kernelized canonical correlation analysis (kCCA) ○ Deep canonical correlation analysis (DCCA) ○ SVD projection ○ Bimodal SAE ○ Unimodal SAE
CCA Find projections that maximally correlate two random vectors ● ○ CCA: linear projections ○ kCCA: nonlinear projections ○ DCCA: nonlinear deep projections ● Vectors X1: textual attributes ○ ○ X2: visual attributes
SVD on tAttrib + vAttrib ● Create matrix of all objects’ textual + visual attributes by row ○ Compute the SVD ○ Use right eigenvectors to project attributes
Bruni’s SVD Collect textual co-occurrence matrix from ukWaC and WaCkypedia ● Collect visual information through SIFT bag-of-visual words on ESP ● ● Harvest text-based and image-based semantic vectors for target words ● Concatenate textual and visual information by row Take the SVD and project objects to lower-rank approximations ●
Similarity Metric ● Take cosine of angle between vectors ● Calculate correlation coefficient against humans
Similarity Metrics Bimodal SAE (skip-gram, vAttrib) gets best results ● ○ Semantic: 0.77 Visual: 0.66 ○
Experiment 2: Categorization ● Unimodal classifiers exist – ResNet ● Can bimodal classifiers perform better? Use same comparison models ●
Experiment 2: Categorization ● Create a graph ○ node: object ○ edge: semantic or visual similarity weight ● Use Chinese Whispers algorithm
Chinese Whispers ● Gets its name from ‘Chinese Whispers’ or ‘telephone’ game ○ Original message is iteratively distorted Nodes iteratively take on class with maximum weight in neighborhood ●
Chinese Whispers
Categorization Results ● Compare classifications against AMT categories ○ Use F-score – harmonic mean of precision and recall Variables ● ○ s: class n: size ○ ○ h: cluster
Categorization Results ● Bimodal SAE (skip-gram, vAttrib) performs best ○ F-score of 0.48
Conclusion ● Bimodal SAE performs ‘inductive inference’ ○ Induce representation for missing mode by learning statistical redundancy between modes ○ Predict reasonable, interpretable textual attributes given visual attributes ■ jellyfish: swim, fish, ocean ■ currant: fruit, ripe, cultivate
Critiques ● Evaluation mainly on unimodal or bimodal data ○ No evaluation on inductive inference Skip-gram + vAttrib outperforms tAttrib + vAttrib ● ● Bootstrap visual attributes with SVM ○ Bottleneck SAE performance on SVM performance
Critiques ● Textual “attributes” and virtual attributes do not share the same vocabulary -- intentional or a mistake? Is it textual attributes or just co-occurrence words? Extract image features using NNs instead of specialized feature extractor? ● ● Data quality & size -- two authors manually inspect concept-visual attribute relations, with only 500+ concepts If only proof of concept -- can it generalize well to abstract concept? ●
Recommend
More recommend