deep complementary
play

Deep complementary Mateusz Budnik 1 features for speaker Ali - PowerPoint PPT Presentation

Deep complementary Mateusz Budnik 1 features for speaker Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2 identification in TV broadcast data 1 Univ. Grenoble-Alpes 2 Ozyegin University Agenda Motivation Related work System


  1. Deep complementary Mateusz Budnik 1 features for speaker Ali Khodabakhsh 2 Laurent Besacier 1 Cenk Demiroglu 2 identification in TV broadcast data 1 Univ. Grenoble-Alpes 2 Ozyegin University

  2. Agenda Motivation ● Related work ● System overview ● Experimental setup and dataset ● Results ● Conclusion and perspectives ● 2

  3. Motivation To investigate the use of a Convolutional Neural ● Network (typical image approach) algorithm for the task of speaker identification. Its fusion with more traditional systems. ● 3

  4. Related work In [1] a CNN is trained using spectrograms in order to ● identify disguised voices. [2] uses 1D convolutions on filter banks. Surrounding ● frames are taken into account and serve as context to reduce noise impact. 1. Lior Uzan and Lior Wolf, “I know that voice: Identifying the voice actor behind the voice,” in Biometrics (ICB), 2015 International Conference on. IEEE, 2015, pp. 46– 51. 2. Pavel Matejka, Le Zhang, Tim Ng, HS Mallidi, Ondrej Glembek, Jeff Ma, and Bing Zhang, “Neural network bottleneck features for language identification,” Proc. IEEE Odyssey, pp. 299–304, 2014. 4

  5. System overview 5

  6. Approaches Convolutional Neural Network ● TVS ● GMM-UBM ● PLDA ● 6

  7. The network structure 7

  8. CNN setup Trained for around 12 epochs ● ReLU and dropout (rate 0.5) after each FC ● No random cropping or rotation ● Average pooling instead of max pooling ● Averaging the scores on spectrograms to get to score for ● a given speech segment 8

  9. GMM-UBM, TVS and PLDA UBM consisting of 1024 gaussians is trained on the ● training data Segmentation outputs of conventional BIC-criterion ● I-vectors dimension is 500 ● Length normalization is used ● 9

  10. Fusion Fusion between TVS and CNN ● Late fusion ● Duration-based late fusion ● s = (1 − tanh(d))s cnn + s ivec ○ Early fusion with SVMs ● CNN’s last hidden layer with PCA (500) + i-vector (500) ○ Linear SVM ○ 10

  11. Dataset The REPERE corpus ● French language, 7 types of videos (news, debates, etc.) ● Noisy and imbalanced ● Train set: ● 821 speakers ○ 9377 speech segments from 148 videos (22h of speech) ○ Test set: ● 113 speakers ○ 2410 segments from 57 videos (6h of speech) ○ 11

  12. Dataset Total amount of speech per speaker for speakers present in both train / test sets of REPERE corpus. Speakers are sorted according to total speech duration in training set. 12

  13. Experimental setup In the test set: ● 24.8% of speech segments are shorter than 2 seconds ○ 70.4% are shorter than 10 seconds ○ MFCC: ● 19 dimensions are extracted every 10 ms with a window length of ○ 20 ms Concatenated with delta and delta-delta coefficients ○ 59 dimensional feature vector after feature warping ○ 13

  14. Experimental setup Spectrograms: ● 240 ms duration with a frequency of 25 Hz ○ Overlap of 200 ms between neighboring spectrograms ○ For each spectrogram: ○ Audio segment was windowed every 5 ms with a window ■ length of 20 ms Hamming windowing ■ Log-spectral amplitude values extraction ■ Final resolution: 48x128 pixels ■ 14

  15. Results 15

  16. Results 16

  17. Conclusion and future work CNN + TVS fusion improves over the baseline ● More data may be needed for CNN (and PLDA) ● Perspectives: ● Multimodal CNN (including faces) ○ Vertical and horizontal CNNs for better insight ○ 17

Recommend


More recommend