Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult¨ at f¨ ur Informatik Technische Universit¨ at M¨ unchen { xiao,stibor } @in.tum.de December 10, 2010
Synesthesia Synesthesia : Perceptual experience in which a stimulus in one modality gives rise to an experience in a different sensory modality. Example: • Picture of golden beach might stimulate human’s hearing by imagining the sound of waves crashing against the shore. • Sound of a baaing sheep might illustrate a green hillside. Images and sounds represent distinct modalities, however, both modalities capture the same underlying concept. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 2 / 21
Explicit/Implicit Linking between Images and Sounds • Explicit: Images and sounds are directly associated (without intermediate links). • Implicit: Images and sounds are not directly associated, however, they are linked together by another intermediate but obscure modality. ����� ����� � ���� � �������� ������� � � � �� �� �� ���� � ���� J.S. BACH VIOLIN � COMPOSER STRING VIOLINIST INSTRUMENT � �������� ������� ��� ���� � ��� � ��� ���� Natural language is based on visual and auditive stimuli ⇒ link images and sounds with text. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 3 / 21
Related Work Domain: Linking an image with associated text (e.g. image annotation, multi-media information retrieval, object recognition). • Probability of associating words with image grids [Hironobu et al., 1999]. • Predicting words from images [Barnard et al., 2003]. • Modeling the generative process of image regions and words in the same latent space [Blei et al., 2003]. • Jointly modeling image, class label and annotations (supervised topic model) [Wang et al., 2009]. Consider images and text as two different languages. Linking images and words can be viewed as a process of translating from visual vocabulary to textual vocabulary. Inspiration: Probabilistic models for text/image analysis (LDA,Corr-LDA). Representation: Bags-of-words model of images and text. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 4 / 21
Input Representation and Preprocessing Build visual vocabulary and auditory vocabulary for representing images and sounds as bags-of-words. Image representation: • Divide image in patches and compute SIFT descriptors ( 128 dim.) for each patch. • Quantize SIFT descriptors in collection using k -means to obtain centroids of learned clusters, which compose the visual vocabulary . Sound representation: • Sound snippet is cut into frames (sequence of 1024 audio samples). • For each frame, compute Mel-Frequency-Cepstral Coefficients. • Each sound snippet is thus represented as set of 25 dimensional feature vectors. • Cluster all feature vectors in collection using k -means to obtain auditory words. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 5 / 21
Notations Annotated image I consists of M visual words and N textual words (annotations), I = { v 1 , . . . , v M ; w 1 , . . . , w N } . � �� � � �� � visual words annotations Captioned sound snippet S consists of M auditory words and N textual words, S = { u 1 , . . . , u M ; w 1 , . . . , w N } . � �� � � �� � auditory words sound tags Training collection T = { I 1 , . . . , I K ; S 1 , . . . , S L } , K annotated images, L tagged sounds. Denote W i vocabulary of image annotations, and W s vocabulary of sound tags. Complete textual vocabulary W = W i ∪ W s . H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 6 / 21
Linking Images and Sounds via Text Image composition: Given an un-annotated image I ∗ / ∈ T , estimate the conditional probability p ( S | I ∗ ) for every sound snippet S ∈ T . Sound illustration: Given an un-tagged sound S ∗ / ∈ T , estimate the conditional probability p ( I | S ∗ ) for every image I ∈ T . Problem: We can not estimate p ( S | I ∗ ) and p ( I | S ∗ ) directly, as no explicit correspondences exist. Idea: “Translate” image into natural language text, then “translate” text back into sound, that is � � p ( S | w ′ ) p ( w ′ | w ) p ( w | I ∗ ) , p ( S | I ∗ ) ≈ w ′ ∈ W s w ∈ W i � � p ( I | w ′ ) p ( w ′ | w ) p ( w | S ∗ ) . p ( S | I ∗ ) ≈ w ∈ W s w ′ ∈ W i H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 7 / 21
Modeling Images/Text and Sounds/Text with Corr-LDA Generative process of an annotated image I = { v 1 , . . . , v M ; w 1 , . . . , w N } : 1 Draw topic proportions θ ∼ Dirichlet( α ) 2 For each visual word v m , m ∈ { 1 , . . . , M } 1 Draw topic assignment z m | θ ∼ Multinomial( θ ) 2 Draw visual word v m | z m ∼ Multinomial 1 ( π z m ) 3 For each textual word w n , n ∈ { 1 , . . . , N } 1 Draw discrete indexing variable y n ∼ Uniform(1 , . . . , M ) 2 Draw textual word w n ∼ Multinomial( β z yn ) z α v π θ M w y β � D K Exchange visual word v m by auditory word u m leads to generative process of modeling sounds and text. 1 Orig. Corr-LDA uses multivariate Gaussian. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 8 / 21
Modeling Images/Text and Sounds/Text with Corr-LDA (cont.) Trained model gives distributions of interests: p ( I | w ) and p ( w | I ∗ ) , where I ∈ T and I ∗ / ∈ T . Specifically, distribution over words conditioned on an unseen image is approximated by: M � � p ( w | I ∗ ) ≈ p ( z m | θ ) p ( w | z m , β ) . m =1 z m Using Bayes rule for p ( I | w ) gives p ( w | I ) p ( I ) p ( I | w ) = I ′ ∈ T p ( w | I ′ ) p ( I ′ ) , � where M N � � p ( I ) = p ( θ | α ) p ( z m | θ ) p ( v m | z m , π ) p ( y n | M ) p ( w n | z y n , β ) . m =1 n =1 H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 9 / 21
Modeling Text Recall: � � p ( S | w ′ ) p ( w ′ | w ) p ( w | I ∗ ) , p ( S | I ∗ ) ≈ w ′ ∈ W s w ∈ W i � � p ( I | w ′ ) p ( w ′ | w ) p ( w | S ∗ ) . p ( S | I ∗ ) ≈ w ∈ W s w ′ ∈ W i Remaining problem: Estimate p ( w ′ | w ) (semantic relatedness between two words). Approach: LDA model, with data set D containing only captions of all images and sounds. Generative process of a document (captions) D ∈ D : 1 Draw topic proportions θ ∼ Dirichlet( α ) 2 For each textual word w n , n ∈ { 1 , . . . , N } 1 Draw topic assignment z n | θ ∼ Multinomial( θ ) 2 Draw textual word w n | z n ∼ Multinomial( β z n ) Two sets of parameters are estimated: Θ D = p ( z | D ) (mixing proportions over topics), and β = p ( w | z ) (word distributions over topics). H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 10 / 21
Modeling Text (cont.) Given trained LDA model, word relatedness between w and w ′ is p LDA ( w | w ′ ) = 1 p ( w | z n ) n w ′ � p ( w ′ | z n ) , C n z n z n where n w ′ is the number of w ′ occurred in D , n z n is the number of words assigned to topic z n , C is normalization factor. Relatedness is calculated on small data set (problematic). Smooth p ( w | w ′ ) by using WordNet dictionary. Exemplary outputs the word relatedness of LDA and WordNet: ������������������� ������������������� ����������� ����������� ���� ��� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� p WordNet ( w | rain ) p LDA ( w | rain ) ����������� p ( w | w ′ ) = σ p LDA ( w | w ′ ) + (1 − σ ) p WordNet ( w | w ′ ) , σ is smoothing param. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 11 / 21
Putting Everything in a Probabilistic Framework #������� #������ ����� %���� $������ ����� $������ %���� ����� ������� ������� ������� ������� ������� ������� ������� ���������� ����� �������� ��������� ���� ��������� ���� ����� ���� � ��� ����� ���� � ��� �� �� ������ ���� �������� ���� ������� ������� ������� ������� ������� ������� &������������ ����� ����� ���� ���� ���� �� ������������� ������� !���"�� ������� ����� ����� &�������� ����� &�������� ����� H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 12 / 21
Recommend
More recommend