z
play

z t t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t - PowerPoint PPT Presentation

G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M


  1. G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M USIC C OMMENTATOR

  2. B ACKGROUND Good arrangement temporal positions associated with Short comments Comments Time Pretty cool! I am impressed Free-form tags given to the entire clip Snapshot from Nico Nico Douga (an influential video-sharing service in Japan) at different times in the real world although they gave comments Users can feel as if they enjoy together for human communication within the clip � Importance of expressing music in language � Language is an understandable common medium

  3. E MERGING P EHENOMEN IN J APAN collaborate to create something at the same time Temporal comments and barrage Sophisticated ASCII art � Commenting itself becomes entertainment � Commenting is an advanced form of collaboration � Users add effects to the video by giving comments � Commenting is a casual way of exhibiting creativity � Temporal comments strengthen a sense of togetherness � Users can feel as if they enjoy all together and � Called pseudo-synchronized communication

  4. M OTIVATION a computer that can express music in language are learned through communication using language could be annotated in music clips Linguistic expression (giving comments) Unseen musical audio signal � Facilitate human communication by developing � Mediated by human-machine interaction � Hypothesis: Linguistic expression is based on learning � Linguistic expressions of various musical properties � Humans acquire a sense of what temporal events

  5. A PPROACH that associates music and language that have been given comments by many users temporal positions of an unseen audio signal Linguistic expression (giving comments) Unseen musical audio signal � Propose a computational model of commenting � Give comments based on machine learning techniques � Train a model from many musical audio signals � Generate suitable comments at appropriate

  6. K EY F EATURES song and has a Ours playing cool the with impressed am I Conv. energetic rock mood. This is a Conv. Ours in an appropriate order positions in a target music clip ! � Deal with temporally allocated comments � Our study: Give comments to appropriate temporal � Conventional studies: Provide tags for an entire clip � Impression-word tags � Genre tags � Generate comments as sentences � Our study: Concatenate an appropriate number of words � Conventional studies: Only select words in a vocabulary � Word orders are not taken into account � Slots of template sentences are filled with words

  7. A PPLICATIONS TO E NTERTAINMENT by using features of both music and comments could be manipulated by using language could be explained by using language Nice guitar Beautiful voice Interlude Quiet intro � Semantic clustering & segmentation of music � The performance could be improved � Users can selectively enjoy their favorite segments � Linguistic interfaces for manipulating music � Segment-based retrieval & recommendation � Retrieval & recommendations results

  8. P ROBLEM S TATEMENT and are allocated at appropriate temporal positions Model Model � Learning phase � Input � Audio signals of music clips � Attached user comments � Output � Commenting model � Commenting phase � Input � Audio signal of a target clip � Attached user comments � Commenting model � Output � Comments that have suitable lengths and contents

  9. F EATURE E XTRACTION per comment Comment features 3000[ms] Time Acoustic features 256[ms] Time co-efficients (MFCCs): 13 dim. � Extract features from each frame � Acoustic features � Timber feature: 28 dim � Mel-frequency cepstrum � Energy: 1 dim. � Dynamic property: 13+1 dim. � Textual features � Comment content: 2000 dim. � Average bag-of-words per comment � Comment density: 1 dim. � Number of user comments � Comment length: 1dim. � Average number of words

  10. B AG - OF -W ORDS F EATURE Count number of each word 4. Counting 3. Assimilation guitar play he 1. Morph. analysis He+played+the+guitar+(^_^) 2. Screening guitar played He He played the guitar (^_^) is equal to vocabulary size 4. 1. particles, auxiliary verbs Morphological analysis 2. Remove auxiliary words he:1 play:1 guitar:1 3. Assimilate same-content words same part-of-speech and basic form � Identify � Part-of-speech � Basic form � Symbols / ASCII arts � Conjunctions, interjections � Do not distinguish words that have � Example:“take”=“took”=“taken � The dimension of bag-of-words features

  11. Bag-of-words feature C OMMENTING M ODEL State sequence in a music clip Multinomial Model(GMM) Gaussian Mixture Comment length Dynamic property MFCCs and energy Gaussian Gaussian Acoustic features → Extend Hidden Markov Model (HMM) Comment density Textual features � Three requirements � All features can be simultaneously modeled � Temporal sequences of features can be modeled � All features share a common dynamical behavior ( n ) ( n ) ( n ) z z z t − t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t

  12. Commenting phase acoustic and textual features Feature extraction Bi-gram Tri-gram analysis Molphological Uni-gram General language model Music clips with temporal positions User comments associated Audio signals M USIC C OMMENTATOR Joint probabilistic model of ② Audio signal Existing user comments Target music clip ① ③ Generate sentences Assembling Learning phase � Comment generation based on machine learning � Consistent in a maximum likelihood (ML) principal “Cool performance” “She has a beautiful voice” Outlining Determine contents&positions “Beautiful” and “this” “cool” and “voice” are likely to jointly occur is likely to occur “This is a beautiful performance” “Cool voice”

  13. Timber, Content, Density, Length Complete Likelihood Posterior Posterior Posterior K =200 (#states) = Objective (Q fuction) L EARNING P HASE � ML Estimation of HMM parameters � Three kinds of parameters π L π { , , } � Initial-state probability 1 K ≤ ≤ { A jk | 1 j , k K } � Transition probability φ L φ { , , } � Output probability 1 K � E-step: Calculate posterior probabilities of latent states � M-step: Independently update output probabilities T T ∏ ∏ θ = π z z p ( O , Z | ) p ( z | ) p ( z | z ) p ( o | z ) z − 1 t t 1 t t − + t 1 t 1 t = = t 2 t 1 ∑ θ θ = θ θ Q ( ; ) p ( Z | O , ) log p ( O , Z | ) o o o old old − + t 1 t t 1 Z K T K K { a , w , d , l } ∑ ∑∑∑ = γ π + ξ ( z ) log ( z , z ) log A t t t t − 1 , k k t 1 , j t , k jk = = = = k 1 t 2 j 1 k 1 T K ∑∑ φ + φ + γ φ log p ( a | ) log p ( w | ) ( z ) log p ( o | ) t a , k t w , k t , k t k + φ + φ = = t 1 k 1 log p ( d | ) log p ( l | ) t d , k t l , k

  14. C OMMENTING P HASE ← Gaussian … … … … ○: Word … SilE SilB ← ??? Computed by the Viterbi algorithm using bi- and tri-grams � ML Estimation of comment sentences � Assume a generative model of word sequences ˆ arg max arg max = = ˆ { c , l } p ( c , l ) p ( c | l ) p ( l ) { c , l } { c , l } p ( l ) Probability that length is l Probability that sequence is c p ( c | l ) when length is l 1 ⎛ ⎞ l l ∏ = ⎜ ⎟ p ( c | l ) p ( w | SilB ) p ( w | w , w ) p ( SilE | w , w ) ⎜ ⎟ − − − 1 i i 2 i 1 l 1 l ⎝ ⎠ = i 2 − − i 2 i 1 i l 1

  15. Bag-of-words Gaussian sentences! Cannot generate State sequence in a target clip Multinomial Gaussian GMM Length Dynamic property MFCCs & Energy user comments in a target clip O UTLINING S TAGE Acoustic faatures ML textual features Density Textual features � Determine content and positions of comments � Input acoustic and textual features � Input only acoustic features if there are no existing � Estimate a ML state sequence � Use the Viterbi algorithm � Calculate ML textual features at each frame ( n ) ( n ) ( n ) z z z t − t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t

  16. P ROBLEMS AND S OLUSIONS a Use general bi- and tri-grams contained sentences are for composing All words required Which is more suitable? This is a good performance This performance is good and was Performance be learned from all user comments � No probabilities of words required for sentences � Bag-of-words feature=Reduced uni-gram � Verb conjugations are not taken into account � Auxiliary words are removed � No probabilities of word concatenations � Bi- and tri- grams are not taken into account

Recommend