Out-of-set i-vector selection for open-set language identification Hamid Behravan, Tomi Kinnunen, Ville Hautamäki School of Computing University of Eastern Finland Odyssey 2016 June 21-24 Bilbao
Closed-set: a test segment corresponds to one of the known target (in-set) languages Target languages English Spanish Which language? Spanish Test uttarance Persian Finnish Swedish 2
Open-set: the language of a test segment might not be any of the in-set languages Target languages English Spanish Persian One of the target languages or Finnish Which language? Out-of-set model Swedish Test uttarance Non-target languages Unknown languages Out-of-set 3
One way to perform open-set LID is to train an out-of-set model LID: language identification 4
What are the good out-of-set candidates? In-set data Out-of-set candidates should come + + + + from different linguistic language A + ++++ + + + + families × × + + + ++ + + × B × + + + + + + + + × Out-of-set candidates should be + + + + + + + + close to in-set languages; while ++++ others far away [Zhang and Hansen, 2014] Good candidates for out-set-data Q. Zhang and J. H. L. Hansen, “Training candidate selection for effective rejection in open-set language 6 identification,” in Proc. of SLT, 2014, pp. 384–389.
Out-of-set candidate detection methods (1) One-class SVM: Idea: Enclose data with an hypersphere and classify new data as normal ( + ) if it falls within the hypersphere and otherwise as out-of-set ( - ). + - 7
Out-of-set candidate detection methods (2) - K-nearest neighbour ( k NN): K=3 d1 d2 d3 - Distance to class mean 8
Proposed method: non-parametric Kolmogorov-Smirnov test Idea: Estimate whether two samples have the same underlying distribution by computing the maximum difference between their empirical cumulative distribution functions (ECDFs): Maximum difference (KS) 9
Adopting Kolmogorov-Smirnov test to our open-set LID task Goal: Giving each unlabeled i-vector an outlier score Taking average over all Compute ECDFs KS values
Computing outlier score for an unlabled i-vector Min . . . 11
KSEs within each language have values close to zero, whereas, they tend to values close to 1 for out-of-set data. Distribution of in-set and OOS KSE values for two different languages, a) Dari and b) French. 12
So far four methods were presented for out-of-set data selection 13
NIST language i-vector challenge 2015 corpus Distribution of training, development and test sets from the NIST 2015 language i-vector machine learning challenge. - 300 i-vectors for each of the 50 target languages - i-vectors are of dimensionality 400 - i-vectors are further post-processed by within-class covariance normalization (WCCN) and linear discriminant analysis (LDA) 14
Segmenting train data into three portions for out-of-set evaluation All portions are subsets of the original NIST 2015 LRE i-vector challenge training set. 15
Example of test utterance labeling for the evaluation of out-of-set (OOS) data detection task given multiple inset languages 16
KSE outperforms kNN and one-class SVM by 14% and 16% relative EER reductions, respectively. 17
Fusion of KSE to baseline OOS detection methods. Fusion of KSE to one-class SVM yields the best performance. 18
The lowest identification cost is 26.61, outperforming the NIST baseline system by 33% relative improvement. Data selected for out-of-set modeling Identification cost Random (1067) 32.11 Training (15000) 32.61 Development (6431) 31.23 Training + development (21431) 31.74 Proposed selection method (1067) 26.61 Closed-set (no OOS model) 37.23 - Results are reported from the NIST evaluation online system - Numbers in parentheses indicate amounts of selected data for OOS modeling - Back-end is based on SVM classifier - NIST Baseline result: 39.59
Open-set LID results for different out-of-set data selection methods. KSE outperforms the other methods. - The results are reported from the NIST evaluation online system. - Out of 1500 out-of-set data, 1012 are classified correctly as out-of-set using KSE. 20
A simple and effective technique to find out-of-set data in the i-vector space. Open-set LID 33% relative reduction in identification accuracy over the closed-set LID 21
Recommend
More recommend