acoustic scene classification by ensembling gradient
play

Acoustic Scene Classification by Ensembling Gradient Boosting - PowerPoint PPT Presentation

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks DCASE 2017 Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra Outline Introduction


  1. Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks DCASE 2017 Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra

  2. Outline Introduction ● Proposed System & Results ● Summary ● 2 2

  3. Introduction Acoustic Scene Classification (ASC) ● 15 acoustic scenes ⇀ system recording environment 3

  4. Introduction Traditionally: feature engineering ● feature extraction ⇀ classifier ⇀ 4

  5. Introduction Traditionally: feature engineering Nowadays: data-driven ● ● feature extraction learning representations ⇀ ⇀ classifier ⇀ 5

  6. Introduction Traditionally: feature engineering Nowadays: data-driven ● ● feature extraction learning representations ⇀ ⇀ classifier ⇀ How about combining both approaches for ASC ? 6

  7. Proposed System Freesound score GBM splitting Extractor aggregation acoustic late scene fusion 10s segment pre- score CNN splitting processing aggregation mel-spectrogram 7

  8. Gradient Boosting Machine audio feature snippets vectors acoustic Freesound score n n GBM n splitting scene Extractor aggregation Freesound Extractor by ● http://essentia.upf.edu/documentation/freesound_extractor.html ● 8

  9. Gradient Boosting Machine audio feature snippets vectors acoustic Freesound score n n GBM n splitting scene Extractor aggregation Gradient Boosting Machine: ● effective in Kaggle ⇀ multiple weak learners (decision trees) ⇀ 9

  10. Gradient Boosting Machine audio feature snippets vectors acoustic Freesound score n n GBM n splitting scene Extractor aggregation Gradient Boosting Machine: ● effective in Kaggle ⇀ multiple weak learners (decision trees) ⇀ added iteratively ⇀ Implementation: ● LigthGBM https://github.com/Microsoft/LightGBM ⇀ 10

  11. Gradient Boosting Machine audio feature snippets vectors acoustic Freesound score n n GBM n splitting scene Extractor aggregation Score aggregation: ● averaging scores across snippets ⇀ argmax ⇀ Results: ● development set ⇀ 4-fold cross-validation provided ⇀ Accuracy: 80.8% ⇀ 11

  12. Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- score n CNN n splitting scene processing aggregation log-scaled mel-spectrogram ● 128 bands ⇀ 12

  13. Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- score n CNN n splitting scene processing aggregation log-scaled mel-spectrogram ● 128 bands ⇀ Time splitting: ● T-F patches 1.5s ⇀ 13

  14. Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- score n CNN n splitting scene processing aggregation 14

  15. Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- score n CNN n splitting scene processing aggregation 15

  16. Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- score n CNN n splitting scene processing aggregation Global time-domain pooling (Valenti, 2016) ● 16

  17. Convolutional Neural Network Design of convolutional filters: ● spectro -temporal patterns for ASC? ⇀ different rectangular filters (Pons, 2017) (Phan, 2016) ⇀ 17

  18. Convolutional Neural Network Design of convolutional filters: ● spectro -temporal patterns for ASC? ⇀ different rectangular filters (Pons, 2017) (Phan, 2016) ⇀ multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) ⇀ Q = 1 18

  19. Convolutional Neural Network Design of convolutional filters: ● spectro -temporal patterns for ASC? ⇀ different rectangular filters (Pons, 2017) (Phan, 2016) ⇀ multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) ⇀ Q = 4 19

  20. Recap Feature engineering: ● Freesound Extractor ⇀ GBM ⇀ Accuracy 80.8% ● 20

  21. Recap Feature engineering: Data-driven ● ● Freesound Extractor log-scaled mel-spectrogram ⇀ ⇀ GBM ⇀ CNN ⇀ Accuracy 80.8% ● Accuracy: 79.9% ● 21

  22. Recap Feature engineering: Data-driven: ● ● Freesound Extractor log-scaled mel-spectrogram ⇀ ⇀ GBM ⇀ CNN ⇀ Accuracy 80.8% ● Accuracy: 79.9% ● How different do they behave? 22

  23. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● 23

  24. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● GBM performs better CNN performs better 24

  25. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● GBM performs better CNN performs better 25

  26. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● GBM performs better CNN performs better 26

  27. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● GBM performs better CNN performs better 27

  28. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● GBM performs better CNN performs better 28

  29. Models’ Comparison (Confusion matrix by GBM - Confusion matrix by CNN) ● 29

  30. Late Fusion GBM: ● prediction probabilities ⇀ CNN: ● softmax activation values ⇀ 30

  31. Late Fusion GBM: ● prediction probabilities ⇀ CNN: ● softmax activation values ⇀ Late fusion approach: ● arithmetic mean + argmax ⇀ System accuracy on development set: ● 83.0 % ⇀ 31

  32. Results residential area ● vs park 32

  33. Results residential area ● vs park tram vs train ● 33

  34. Results residential area ● vs park tram vs train ● grocery store vs ● cafe/resto 34

  35. Challenge Ranking accuracy drop ● outperforming baseline by absolute 6.3 % ● 35

  36. Summary Ensemble of two models ● Simplicity of models: ● GBM + out-of-box feature extractor ⇀ CNN using domain knowledge ⇀ providing complementary information ⇀ Simple late fusion method ● Reasonable results although room for improvement ● individual models ⇀ fusion approach ⇀ 36

  37. Thank you! 37

  38. References ● H. Phan, L. Hertel, M. Maass, and A. Mertins, “ Robust audio event recognition with 1-max pooling convolutional neural networks ”, arXiv preprint arXiv:1604.06338, 2016. J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra, “ Timbre Analysis of Music ● Audio Signals with Convolutional Neural Networks ”, in 25th European Signal Processing Conference (EUSIPCO2017). ● M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “ DCASE 2016 acoustic scene classification using convolutional neural networks ,” in Proc. Workshop Detection Classif. Acoust. Scenes Events, 2016. 38

Recommend


More recommend