Automatic Classification of Fricatives Using t-SNE Yizhar Lavner 1 and Alex Frid 1,2 1 Department of Computer Science, Tel-Hai College, Israel 2 Edmond J. Safra Brain Research Center for the Study of Learning Disabilities, University of Haifa, Israel XXII Annual Pacific Voice Conference Krakow Poland, April 2014
Phoneme analysis • The fricatives were analyzed and various features (in time and spectrum domain) are computed.
Supervised Learning using t-SNE • t- distributed Stochastic Neighbor Embedding (t-SNE), ( van der Maaten & Hinton,2008 ) • A non-linear method for Dimensionality reduction. • t-SNE aims at preserving the local neighborhood structure of a set of data points in a high- dimensional space while converting it into 2 or 3 dimensional data. • Global structure such as clusters can be also preserved.
t-SNE – cont. • High dimensional space: Converting distances between data points into pairwise conditional probabilities (similarities, affinities) x according to a Gaussian pdf, centered at : i 2 2 /2 x x i j i e p Xi 0.2 | j i 2 2 /2 x x Xj 0.15 i k i e 0.1 Pdf k i Xj 0.05 p p Setting: 0 | | j i i j p 2 ij 2 2 n 0 0 -2 -2 x2 x1
t-SNE – cont. • Low dimensional space: Converting distances between data points into pairwise joint probabilities using student-t distribution ( 1 df ): 1 2 1 y y 0.2 i j 0.15 q 0.1 ij 1 2 0.05 1 y y 0 k l 2 2 k l 1 0 0 -1 -2 -2 • Better optimization, aims at solving the crowding problem (heavy tailed distribution).
t-SNE – cont. • Embedding map points (low dimensional space) minimization of the cost function ( Kullback-Leibler divergence ) : p ij || log C KL P Q p ij q i j ij • The gradient of KL: 1 C 2 4 1 p q y y y y ij ij i j i j y j i • Optimization: gradient descent with a momentum term
t-SNE • [3D images / movie] 100 • /s/ 50 0 • /∫/ -50 • /f/ -100 -150 • / θ / 100 80 60 40 20 0 -20 -40 -60 -100 -80 -80 -60 -40 -20 0 20 -100 40 60 80 100
Classification using t-SNE d=(2,3 …) Perplexity k /s/ Mapped kNN / Feature /∫/ Speech t-SNE vectors Majority vectors frames /f/ (d=3) vote (d=24) / θ / • 25,000 feature vectors (each from one frame of 8 msec.) • Paremeter selection based on preliminary experiments (perplexity=5-10, k=7-9, d=3). • Cross validation – 100 runs, 80% train, 20% test.
Results • Before dimension reduction (kNN with d=24): Frames correct rate = 76.8%. • After mapping into 3-d using t-SNE using kNN: Frames correct rate = 73.6%. Fricative/ /s/ /∫/ /f/ / θ / Detected as: /s/ 83.8% 5.3% 11.0% 8.2% /∫/ 2.6% 85.7% 1.8% 12.8% /f/ 10.5% 3.0% 69. 8% 35.6% / θ / 3.1% 6.0% 17.4% 43.4%
Some More Results • Before dimension reduction (SVM with d=12): Frames correct rate = 86.9% (with majority vote). • After mapping into 3-d using t-SNE with SVM: Frames correct rate = 89.4% . (use of majority vote raises the results by 3%) Detected as : /s/ /∫/ /f/ /θ/ /s/ 88.5% 2.2% 6.5% 2.8% /∫/ 3.0% 90.9% 2.3% 3.8% /f/ 4.4% 0.9% 85.8% 8.9% /θ/ 0.5% 0.0% 7.0% 92.5% Frid-Lavner (IWSSIP2014)
Recommend
More recommend