An Algorithm for Vector Quantisation II The average distortion D i in cell C i is given by 1 d ( x , z i ) � = D i N x ∈ C i where • z i is the centroid of cell C i and • d ( x , z i ) = ( x − z i ) T ( x − z i ) • N is the number of vectors The centroids that are obtained finally are then stored in a codebook called the VQ codebook . Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 22/133
An Algorithm for Vector Quantisation III d(.,.) Training set Clustering of Vectors Codebook Algorithm B v v ... v M = 2 Vectors 1 2 L d(.,.) Codebook Input Indices Speech Quantizer Vectors Figure: System based on VQ Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 23/133
An Algorithm for Vector Quantisation IV Initialization Iter = 1 Iter = 3 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 Iter = 8 Iter = 5 10 10 5 5 0 0 −5 −5 −10 −10 −10 −5 0 5 10 −10 −5 0 5 10 Figure: A two-cluster VQ Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 24/133
Gaussian Mixture Model • Representation of Multimodal distributions: Gaussian mixture model (GMM) • GMM is a linear superposition of multiple Gaussians: p ( x ) = Σ K k = 1 π k N ( x | µ k , Σ k ) • For a d-dimensional feature vector representation of data, the parameters of a component in a GMM are: • Mixture coefficient π • d -dimension mean vector µ • d × d size covariance matrix Σ Maximum likelihood method for training a GMM: Expectation-Maximization (EM) method Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 25/133
Gaussian Mixture Densities and its relation to VQ • The VQ codebook is modelled as a family of Gaussian probability density functions (PDF). • Each cell is represented as a multi-dimensional PDF. • These probability density functions can overlap rather than partition the acoustic space into disjoint subspaces. • correlations between different elements of the feature vector can also be accounted in this representation. • Expectation Maximization (EM) algorithm is used to estimate the density functions. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 26/133
Gaussian Mixture Models – Estimation of Parameters Interpretation of Gaussian mixtures • Use of discrete latent variables: • d -dimensional binary random variable 1-of-K form. • p ( x , z ) = p ( x | z ) p ( z ) . • p ( z k = 1 ) = π k , 0 ≤ k ≤ K . • � K k = 1 π k = 1 • Meaning: Every point x is described completely by z k . • The weight π k indicates the responsibility of the k th mixture in explaining the point x . Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 27/133
Expectation Maximisation Algorithm Given a Gaussian mixture model, maximize the likelihood function with respect to the parameters 1. Initialize the means µ k , covariances Σ k and mixing coefficients π k 2. Evaluate the initial value of the log likelihood 3. E step: Evaluate responsibilities p ( z k = 1 | x n ) = γ ( z nk ) using current parameter values 4. M step: Re-estimate the parameters µ k , Σ k and π k . 5. Using current responsibilities, evaluate the log likelihood 6. Check for convergence of either the parameters or the log likelihood. 7. If the convergence criterion is not satisfied return to step 4. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 28/133
Example (Adapted from Bishop) I Initialization Iter = 1 Iter = 17 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −10 −5 0 5 10 15 −10 −5 0 5 10 15 −10 −5 0 5 10 15 Iter = 30 Iter = 45 10 10 5 5 0 0 −5 −5 −10 −10 −10 −5 0 5 10 15 −10 −5 0 5 10 15 Figure: A two-mixture GMM Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 29/133
Example (Adapted from Bishop) II Data K−Means GMM 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 −6 −6 −6 −8 −8 −8 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 Figure: A three-cluster k-means and three-mixture GMM Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 30/133
GMM based Classifier p(x| ) λ 1 GMM 1 p(x| ) λ 2 GMM 2 Class Label Decision x Logic p(x| ) λ Μ GMM M x is a feature vector that is obtained from a test sample. p ( x | λ m ) = Σ K k = 1 π k N ( x | µ mk , Σ mk ) Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 31/133
UBM-GMM framework The Universal Background Model based GMM (UBM-GMM) is another popular framework in the literature: • Works well when there is imbalance in the data for different classes. • Normalisation of scores across different classes. • Useful in the context of verification paradigm for classification. In the context of music: • Reduces the search space. • Verify whether a give song belongs to a specific melody. Philosophy of UBM-GMM • Pool data from all classes. • Build a single UBM-GMM to represent the pooled data. • Adapt the UBM-GMM using the data from each specific class. • Test against the adapted model. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 32/133
UBM-GMM framework – An example 8 8 5 UBM data UBM data data New data UBM UBM 4 UBM Adapted 6 6 Adapted 3 4 4 2 2 2 1 0 0 0 −1 −2 −2 −2 −4 −4 −3 −6 −6 −4 −8 −8 −5 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −6 −4 −2 0 2 4 6 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 33/133
GMM based system for rAga identification • During training, a Gaussian Mixture Model (GMM) is built for each of rAga using frame level pitch features. • During testing, GMM models produce class conditional acoustic likelihood scores for a test utterance. • The GMM that gives the maximum acoustic score is chosen and decision about the rAga is made (since each GMM corresponds to a rAga ). Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 34/133
GMM based Melody ( rAga ) identification I • Objective: To identify a rAga from 30 second pieces • Pitch was extracted from these pieces. • Normalized on the cent scale and collapsed onto a single octave. • sampUrna and janya rAgas were experimented with • The number of mixtures were chosen based on the number of notes in a rAga . Figure shows the GMM for the rAga hamsadhwani . Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 35/133
GMM based Melody ( rAga ) identification II Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 36/133
GMM based Melody ( rAga ) identification III RAGA Ham SuDha SuSav Kal Kara SriRan Hari Shank 0.3333 Ham 0 0 0 0.1111 0 0.4444 0.1111 0.3846 SuDha 0.3846 0.0769 0 0 0 0 0.1538 0.4000 SuSav 0.2000 0 0 0 0.2000 0.1000 0.1000 0.3333 Kal 0.2000 0 0.1333 0 0 0.2667 0.0667 0.4000 Kara 0 0.2500 0.1000 0.1000 0 0.0500 0.1000 0.4762 SriRan 0 0 0 0.0952 0.0952 0 0.3333 0.6500 Hari 0.1000 0 0 0.1000 0.0500 0.0500 0.0500 0.2000 Shank 0.2000 0 0.3000 0.2000 0 0.1000 0 Table: Confusion Matrix - GMM Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 37/133
GMM Tokenisation I GMM Tokenization: • Parallel set of GMM tokenizers. • A bank of tokenizer dependent interpolated motif models (unigram, bigram) • Tokeniser produces frame-by-frame indices of the highest scoring GMM component. • Likelihood of stream of symbols from each tokenizer evaluated by the language models for each rAga . • A backend classifier determines the rAga . Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 38/133
GMM Tokenisation II sankarAbharana model sankarAbharana nilAmbari Feature GMM model Extraction Hyp. Tokenizer Classifier rAga bhairavi model Average of sankarAbharana Log Likelihood bhairavi model GMM nilAmbari Tokenizer model bhairavi model Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 39/133
GMM Tokenisation III The rationale: Cognitively, do we try to identify rAgas using the models that we know: • A listener would state that s/he is not able to place a rAga but state that in parts it sounds like nilAmbari ( bhairavi ) • It might be a sankarAbharanam or yadukulakAmbOji ( mAnji ) Question is: What is a meaningful token? Our initial results using pitch as a feature were miserable! Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 40/133
5. String Matching Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 41/133
String Matching and Music Analysis Music has a lot structure: • Notes correspond to specific frequencies. • Sequence of notes that make a phrase identify a melody. Analysis of Music: • Transcription into notes. • Identification of phrases. Solution: Longest Common Subsequence Match: • Notes occur for a longer duration. • A particular note is missed but still belongs to the same melody. • Different notes are of different duration – example – the same song sung in a different metre. Training: • Transcribe training examples to a sequence of notes. • Make a set of templates in terms of symbols. Testing: • Transcribe test fragment to a sequence of notes. • Compare with trained templates using Dynamic programming. • Longest match identifies the melody. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 42/133
String Matching Example using Dynamic Programming Q 1 2 3 3 3 4 4 2 U 2 3 1 3 3 4 4 1 B 1 2 3 3 3 3 3 1 W 2 2 2 2 2 2 1 1 K 2 2 2 2 2 2 1 1 1 1 N 1 1 1 1 1 1 1 1 1 X 0 0 0 0 0 Q N K B Y X U T L C S = N K B U Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 43/133
Dynamic Time Warping Time Normalisation constraints • endpoint alignment T y • monotonicity constraints Test Pattern • local continuity constraints • global path constraints • slope weighting 2 T 2 Reference Pattern x Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 44/133
Dynamic Programming based Matching for Music I Feature used: Tonic normalised pitch contour Distance measure: Euclidean distance between pitch values Reference Mapping 200 200 150 150 100 100 Warped Query with Reference 1300 50 50 1200 0 0 1100 1300 1200 1100 1000 900 800 700 0 20 40 60 80 100 120 140 1000 Query 900 1300 800 1200 1100 700 0 50 100 150 200 1000 900 800 700 0 20 40 60 80 100 120 140 Figure shows the optimal DTW path. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 45/133
Dynamic Programming based Matching for Music II Reference 1300 1200 1100 1000 900 800 700 0 20 40 60 80 100 120 140 160 180 Query 1300 1200 1100 1000 900 800 700 0 20 40 60 80 100 120 140 160 180 Figure shows the template and query used. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 46/133
6. Overview of Hidden Markov Models (HMMs) Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 47/133
Drawbacks of the DTW based approach • Endpoint constraints. • monotonicity, global path constraints. • When the templates vary significantly, requires a large number of different templates. • An alternative: Hidden Markov Models. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 48/133
The HMM Framework • Provides a statistical model for characterising the properties of sequences signal. • Is currently used in the industry. • Need large amounts of training data to get reliable models. • Choosing the best structure is a difficult task. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 49/133
Observable Markov Model An Observable Markov Model is a finite automaton: • finite number of states • transitions Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 50/133
Hidden Markov Model • Suppose now • Output associated with each state is probabilistic • State in which the system is is hidden —only the observation is revealed • Such a system is called a Hidden Markov Model Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 51/133
HMM example Coin Toss experiment: • One or more coins are tossed behind an opaque curtain • How many coins are there? Which coin is chosen? - unknown. • The result of experiments are known. A typical observation sequence could be: O ( o 1 , o 2 , o 2 , ..., o T ) = = ( HHTTTHTTH ... H ) Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 52/133
Two Coin Model 2−Coins Model Hidden Markov Model a11 a 22 1−a11 Coin1 Coin2 2 1 1−a22 O = H H T T H T H H T T H... State Sequence = 2 1 1 2 2 2 1 2 2 1 2... or 1 1 1 2 1 1 2 1 1 2 1... Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 53/133
Three-coin Model 3−Coins Model Hidden Markov Model a11 a 22 a 12 Coin1 2 Coin2 1 a21 a 31 a a 32 13 a 23 Coin3 3 a 33 O = H H T T H T H H T T H... State Sequence = 3 1 2 3 3 1 1 2 3 1 3 ... or 1 2 3 1 1 2 2 3 1 2 3 ... Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 54/133
HMM (cont’d) • System can be in one of N distinct states • Change from one state to another occurs at discrete time instants. • This change is • probabilistic • depends only on R preceding states (usually R = 1) • a ij represents prob. of transition from state i at time t to state j at time t + 1 • Each state has M distinct observations associated with it. b j ( k ) is prob. of observing the k -th symbol in state j • Prob. that system is initially in the i -th state is π i Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 55/133
Three-coin Model 3−Coins Model Hidden Markov Model a11 a 22 a 12 Coin1 2 Coin2 1 a21 a 31 a a 32 13 a 23 Coin3 3 a 33 O = H H T T H T H H T T H... State Sequence = 3 1 2 3 3 1 1 2 3 1 3 ... or 1 2 3 1 1 2 2 3 1 2 3 ... Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 56/133
HMM (cont’d) • Observation sequence: O = ( o 1 o 2 . . . o T ) • State sequence: q = ( q 1 q 2 . . . q T ) • Model: λ = ( A , B , π ) Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 57/133
The Three Basic Problems 1. Testing: Given O = ( o 1 o 2 . . . o T ) and λ how do we efficiently compute P ( O | λ ) ? • Given a sequence of speech frames that are known to represent a digit, how do we recognize what digit it is? Since we have models for all the digits, choose that digit for which P ( O | λ ) is the maximum • Efficiency is crucial because the straightforward approach is computationally infeasible • Forward-Backward procedure Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 58/133
Evaluation an example I Consider the two-coin model, and the observation sequence O = { o 1 , o 2 , o 3 } = { H , T , H } Corresponding to the three observations we have three states Q = { q 1 , q 2 , q 3 } The state sequences can be any of: q 1 q 2 q 3 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 2 1 2 2 2 1 2 2 2 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 59/133
Evaluation an example II Consider the state sequence Q = { 1 , 2 , 2 } P ( O|Q , λ ) = P ( HTH |Q = { 1 , 2 , 2 } ) = b 1 ( H ) b 2 ( T ) b 2 ( H ) P ( Q| λ ) = π 1 a 12 a 22 � � P ( O| λ ) = P ( O , Q| λ ) = P ( O , Q| λ ) P ( Q| λ ) Q Q To evaluate P ( O| λ ) , we have to marginalise over all the state sequences above. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 60/133
HMM Training 2. Training: How do we adjust the parameters of λ to maximize P ( O | λ ) ? • Given a number of utterances for the word “two” adjust the parameters of λ until P ( O | two model ) converges. • This procedure is called training • Expectation-Maximization (Baum-Welch) algorithm Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 61/133
Best State Sequence I 3. Best State Sequence: Given O = ( o 1 o 2 . . . o T ) and λ how do we choose a corresponding q that best “explains” the observation? • For the hypothesized digit model, which state sequence best “explains” the observations? • The answer is strongly influenced by the optimality criterion • Single best state sequence is the commonly chosen criterion—Viterbi algorithm • Example: “The eight frames of speech from t = 120 ms to 260 ms are best explained by the ‘two’ model’s state sequence s 2 - s 3 - s 3 - s 4 - s 4 - s 4 - s 5 - s 5 ” Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 62/133
Solution to Problem 1: Testing • P ( O | λ ) = P ( O | q , λ ) P ( q | λ ) � all q • Direct method is computationally prohibitive – all q = ⇒ all possible state sequences. • Define variables for the probability of partial observation sequences: P ( o 1 o 2 . . . o t | q t = i , λ ) α t ( i ) = P ( o t + 1 o t + 2 . . . o T | q t = i , λ ) β t ( i ) = • A very efficient inductive procedure exists for computing P ( O | λ ) — Forward-Backward algorithm Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 63/133
Computation of α t ( i ) a1j S 1 a2j S S j a3j 2 . aNj . . S N t t+1 Time α (i) α (j) t t+1 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 64/133
Solution to Problem 2: Training I • For the given model λ we need to adjust A , B , and π to satisfy the chosen optimization criterion • Closed-form solution does not exist • Training data is used in an iterative manner and model parameters are estimated until convergence • The probabilities are estimated using relative frequency of occurrence • ¯ π j = expected number of times in state i at time t=1 a ij = expected number of transitions from state i to state j • ¯ expected number of transitions from state i b j ( k ) = expected number of times in state j and observing the k symbol • ¯ expected number of times in state j Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 65/133
Solution to Problem 3: Best State Sequence • Suppose δ t ( i ) be the most probable single path at time t , which accounts for the first t observations and ends in state i • By induction � � · b j ( o t + 1 ) δ t + 1 ( j ) = max δ t ( i ) a ij i • To actually retrieve the state sequence we need to keeptrack of the argument that maximized the above for each t and j Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 66/133
Dynamic Time Warping Time Normalisation constraints • endpoint alignment T y • monotonicity constraints Test Pattern • local continuity constraints • global path constraints • slope weighting 2 T 2 Reference Pattern x Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 67/133
Markov Model for Music a 22 a 33 a a 11 44 a a a 12 23 34 1 2 3 4 a a 24 13 b (o ) b (o ) b (o ) b (o ) b (o ) b (o ) 1 1 1 2 2 3 3 4 3 5 4 6 Observation o o o o o o Sequence 1 2 3 4 5 6 π = 0, ι = 1 ι a = 0 , j < i ij π = 1, ι = 1 ι Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 68/133
HMM Training Model Initialization Model Convergence State Seq− uence Segm. Training Model Data Par. Symbol sequence Model Reestimation Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 69/133
Isolated Motif Recognition System HMM for Motif 1 λ 1 λ Probability P(O| ) 1 Computation Observation Feature Music Analysis, Sequence HMM for Signal (Vector λ O 2 Motif 2 S Quantization) Index of Probability Recog. Computation Motif Select Maximum P(O| λ ) 2 HMM for λ v Motif v Probability P(O| λ ) v Computation Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 70/133
HMM for Phrase Identification The figure shows a specific motif for the rAga kAmbOji as sung by three different musicians. The alignment in terms of the states is given below. Note: The HMM used here includes skip states to highlight the state transitions. Details can be found in Vignesh’s paper. 2 3 5 8 11 12 1400 1200 1000 800 600 3 5 2 4 8 11 12 1400 1200 1000 800 600 12 2 3 5 8 9 11 1400 1200 1000 800 600 5 12 2 3 8 11 1400 1200 1000 800 600 State Occupancy on Viterbi Alignment Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 71/133
Key Phrase Spotting I An e-HMM framework can be used for spotting key phrases in a musical piece. • The garbage model: This is an HMM to represent everything in a piece that is not a typical phrase. • The Motif HMMs correspond to different motifs that can be used to identify the piece. • The transition across motifs can transit via the garbage or directly. • The probability of transition needs to be determined from training data. • Initially all transitions can be made equiprobable. • The parameters can be learned using an EM framework. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 72/133
Key Phrase Spotting II Garbage Model HMM for Motif 1 HMM for Motif 2 HMM for Motif 4 HMM for Motif 3 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 73/133
7. Support Vector Machines for Pattern Classification Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 74/133
Key Aspects of Kernel Methods Kernel methods involve: • Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel • Detection of optimal linear solutions in the kernel feature space • Transformation to a higher dimensional space: Conversion of nonlinear relations into linear relations (Cover’s theorem) • Nonlinearly separable patterns to linearly separable patterns • Nonlinear regression to linear regression • Nonlinear separation of clusters to linear separation of clusters Key Feature: “Pattern analysis methods are implemented such that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only.” The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick) Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 75/133
Illustration of Transformation Transformation: √ F ( x ) { x 2 , y 2 , = 2 xy } = { z 1 , z 2 , z 3 } Z Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 76/133
Optimal Separating Hyperplane for linearly Separable Classes I Support vectors x 2 margin 1 || w || x 1 w t x + b = +1 w t x + b = 0 w t x + b = − 1 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 77/133
Optimal Separating Hyperplane for linearly Separable Classes II • Training data set consists of L examples: { ( x i , y i ) } L i = 1 , x i ∈ R d and y i ∈ { + 1 , − 1 } , where x i is i th training example and y i is the corresponding class label. • Hyperplane is given by: w t x + b = 0, where w is the parameter vector and b is the bias. • A separating hyperplane satisfies the following constraints: y i ( w t x i + b ) > 0 for i = 1 , 2 , ..., L (1) • Canonical form (reduces the search space): y i ( w t x i + b ) ≥ 1 for i = 1 , 2 , ..., L (2) 1 • Distance between nearest example and separating hyperplane (margin) is: || w || . || w || or minimises w . 1 • Solution maximises Learning problem: Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 78/133
Optimal Separating Hyperplane for linearly Separable Classes III • Constrained Optimisation J ( w ) = 1 2 || w || 2 (3) • The Lagrangian objective function: L L p ( w , b , α ) = 1 2 || w || 2 − α i [ y i ( w t x i + b ) − 1 ] � (4) i = 1 where α i are nonnegative and are called Lagrange multipliers. • Saddle point of Lagrangian objective function is a solution. • The Lagrangian objective function is minimised with respect to w and b , and then maximised with respect to α . • Conditions of optimality due to minimisation are ∂ L p ( w , b , α ) = 0 (5) ∂ w ∂ L p ( w , b , α ) = 0 (6) ∂ b Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 79/133
Optimal Separating Hyperplane for linearly Separable Classes IV • Application of optimality conditions gives L w = α i y i x i � (7) i = 1 L � α i y i = 0 (8) i = 1 • Substitute w from (7) in (4), use condition in (8), the dual form: L L L α i − 1 � � � α i α j y i y j x i t x j L d ( α ) = (9) 2 i = 1 i = 1 j = 1 • Maximize the objective function L d ( α ) subject to the following constraints: L � α i y i = 0 (10) i = 1 α i ≥ 0 for i = 1 , 2 , ..., L (11) gives the optimum values of Lagrange multipliers { α ∗ j } L s j = 1 . Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 80/133
Optimal Separating Hyperplane for linearly Separable Classes V • The optimum parameter vector w ∗ is given by L s w ∗ = � j y j x j α ∗ (12) j = 1 where L s is the number of support vectors. • The discriminant function of the optimal hyperplane ( y j are support vectors) L s D ( x ) = w ∗ t x + b ∗ = � j y j x t x j + b ∗ α ∗ (13) j = 1 where b ∗ is the optimum bias. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 81/133
Maximum Margin Hyperplane I Linearly Non-separable classes, the constraints in (2) are modified to include non-negative slack variables ξ i as follows: y i ( w t x i + b ) ≥ 1 − ξ i for i = 1 , 2 , ..., L (14) • The slack variable ξ i measures the deviation of a data point x i from the ideal condition of separability. • For 0 ≤ ξ i ≤ 1, the data point falls in the correct side of the separating hyperplane. • For ξ i > 1, the data point falls on the wrong side of the separating hyperplane. • Support vectors are data points that satisfy the constraint in (14) with equality sign. • The discriminant function of the optimal hyperplane for an input vector x is given by L s D ( x ) = w ∗ t x + b ∗ = � j y j x t x j + b ∗ α ∗ (15) j = 1 where b ∗ is the optimum bias. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 82/133
Maximum Margin Hyperplane II Support vectors x 2 margin 1 || w || x 1 w t x + b = +1 w t x + b = 0 w t x + b = − 1 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 83/133
Support Vector Machines K ( x , x 1 ) α ∗ 1 y 1 K ( x , x 2 ) α ∗ 2 y 2 Input vector x Output � D ( x ) α ∗ L s y L s Linear output node K ( x , x L s ) Input layer Bias, b ∗ Hidden layer of L s nodes Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 84/133
Kernel Functions • Mercer kernels: Kernel functions that satisfy Mercer’s theorem • Kernel gram matrix: Contains the value of the kernel function on all pairs of data points in the training data set • Kernel gram matrix properties: positive semi-definite, for convergence of the iterative method Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 85/133
Nonlinearly Separable Problems I • Optimal hyperplane in the high dimensional feature space Φ ( x ) The Lagrangian objective function takes the following form: L L L α i − 1 α i α j y i y j Φ ( x i ) t Φ ( x j ) � � � L d ( α ) = (16) 2 i = 1 i = 1 j = 1 subject to the constraints: L � α i y i = 0 (17) i = 1 and 0 ≤ α i ≤ C for i = 1 , 2 , ..., L (18) • The discriminant function for an input vector x is given by: L s D ( x ) = w ∗ t Φ ( x ) + b ∗ = � j y j Φ ( x ) t Φ ( x j ) + b ∗ α ∗ (19) j = 1 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 86/133
Nonlinearly Separable Problems II Some commonly used innerproduct kernel functions are as follows: K ( x i , x j ) = ( a x t i x j + c ) p Polynomial kernel: K ( x i , x j ) = tanh ( a x t i x j + c ) Sigmoidal kernel: K ( x i , x j ) = exp ( − δ � x i − x j � 2 ) Gaussian kernel: Table shows the performance of SVM vs MVN on rAga recognition: In the Table: • PCD ET - 12 bins Equal Temperament, • PCD ET 2 = 24 bins Equal Temperament, • PCD JI - 22 bin Just Intonation SVM MVN PCD et 56.8755% 44.54% 63.9318% 58.58% PCD et 2 58.9097% 50.62% PCD ji Table: Accuracy in raga recognition Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 87/133
Summary of rAga recognition techniques • Discriminative training helps. • Multivariate Unimodal distributions shows comparable performance. • GMM based on number of notes in the rAga is a poor model. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 88/133
Support Vector Machines and Music Analysis Advantages: • Discriminative Learning. • Testing – simple inner product. • Gives identity for patterns (support vectors) that discriminate between a pair of classes (example: two allied rAgas ) Disadvantages: • Fixed length pattern. • Requires transformation of data to fixed length patterns – intermediate matching kernel. • Finding an appropriate kernel is a hard task. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 89/133
8. Non-Negative Matrix Factorisation Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 90/133
Non-negative Matrix Factorisation (NMF) I Templates of sounds as building blocks: • Notes form music. • Phoneme-like structures combine to form speech. Sounds correspond to such building blocks • Building blocks are structures that appear repeatedly. • Basic building blocks do not distort each other – but add constructively. Goal: To learn these building blocks from data. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 91/133
An Example: A Vocal Carnatic Music Performance • We see and hear 4 or 5 distinct voices (or sources) – lead vocalist, drone, violin, mridanga, ghatam. • We should discover 4 or 5 building blocks. • A Solution: Non-negative Matrix Factorisation (NMF). • The Spectrum V can be represented as a linear combination of a dictionary of spectral vectors V ≈ WH Here W corresponds to the dictionary, H corresponds to the activation of the spectral vectors. • Optimisation Problem: Estimate W and H such that divergence between V and WH are minimised: D ( V | WH ) argmin W , H W > = 0 subject to H > = 0 . Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 92/133
Popular Divergence Measures I • Euclidean: � � ( V f , t − [ WH ] | f , t ) 2 D EUC ( V | WH ) = t f • Generalised Kullback Leibler (KL): V f , t � � D KL ( V | WH ) = ( V f , t log − V f , t + [ WH ] f , t ) [ WH ] f , t f t • Itakura-Saito (IS): − log V f , t � � D IS ( V | WH ) = [ WH ] f , t − 1 ) ( V f , t [ WH ] f , t V f , t f t Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 93/133
Update Rules Solution to the optimisation problem: • Randomly initialise W and H . • Iterate between the following update rules (for generalised KL divergence): � f W f , k V f , t / ( WH ) f , t ← H k , t H k , t � f ′ W f ′ , k � t H k , t V f , t / ( WH ) f , t ← W f , k W f , k � t ′ H f , t ′ Other divergence measures have similar update rules. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 94/133
Example of Tanpura Extraction from a Vocal Concert Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 95/133
Pitch Extraction from Synthesised Tanpura Pitch Contour 500 400 300 200 100 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Pitch Contour after NMF 500 400 300 200 100 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 96/133
NMF and its issues • Learning the dictionary requires good training examples. • The piece has to be first segmented to get separation of voices – refer to Ashwin Bellur’s presentation on the same. • The approach seems to be promising for source separation. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 97/133
9. Bayesian Information Criterion Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 98/133
Bayesian Information Criterion for Change Point Detection I A maximum likelihood approach to change point detection • Let x = { x i ∈ R d , i = 1 , ..., N } be a set of feature vectors extracted from audio. • x i ∼ N ( µ i , Σ i ) where µ i is the mean vector Σ i is the full covariance matrix. • Change point detection can be posed as hypothesis testing problem at time i : H 0 : x 1 , · · · , x N ∼ N ( µ, Σ) versus H 1 : x 1 , · · · , x i ∼ N ( µ 1 , Σ 1 ); x i + 1 , · · · , x N ∼ N ( µ 2 , Σ 2 ) • The maximum likelihood ratio statitic is given by: R ( i ) = N log | Σ | − N 1 log | Σ 1 | − N 2 log | Σ 2 | where N , N 1 and N 2 are the number of samples for entire segment, the segment from 1 to i and from i + 1 to N , respectively. Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s Slide 99/133
Recommend
More recommend