speech segment classification on music radio shows using
play

Speech segment classification on music radio shows using machine - PowerPoint PPT Presentation

Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan Introduction This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to


  1. Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan

  2. Introduction • This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to DS2010. The paper, data, samples etc are on my web site @ http://www.developer-x.com/papers/asot/DS2010_svm_voice_segment_r10.pdf • We were interested in predicting intervals of speech in electronic dance music radio shows • We ended up working on one show “A State of Trance” with current world #1 DJ Armin van Buuren  • We were specifically interested in Armin’s voice, no other people’s voices, singing etc.

  3. Why? • It’s very useful to have temporal metadata for audio streams i.e. When are the adverts? When is the traffic information? • Audio is slower to index/label than video. On videos you can scrub through and ascertain the structure quickly. • Most ad- hoc audio streams don’t have associated temporal metadata.

  4. Methodology • These radio shows are 2 hours long • We took an approach typical of machine learning i.e. discretising the show into feature vectors (representing 1 second each) and training a learning machine model on historical examples • For simplicity we worked with 5 minute segments (299 seconds from each) from 9 different shows. • We labelled them and used one for training data, the remaining 8 were concatenated together and used for testing.

  5. Data (post-feature extraction) • Training Set – 299 examples • 28 speech • 271 non-speech • Test Set – 2392 examples • 291 speech • 2101 non-speech • 1 second : 1 example

  6. Audio Analysis (1) • Audio is in the time domain... • You can’t do much useful analysis in the time domain unless it’s just a sine wave!

  7. Audio Analysis (2) • The problem is, in the time domain everything gets mixed together. Here are 2 simple sine waves mixed up:

  8. Audio Analysis (3) • Now lets look at some typical audio from one of these radio shows. • Here is about 230 samples of stereo audio. What a mess!

  9. Audio Analysis (4) • What if there was a way to transform the signal into the frequency domain, and discard all time information? • Enter Fourier analysis

  10. Fourier Analysis • Fourier Analysis represents any function as a set of multiple integer oscillations of trigonometric functions.

  11. Temporal Feature Extraction Strategy • We could just window the audio at 44100 sample intervals and run a DFT on each. • However this is too coarse; we want to capture the temporal “fabric” or “texture” of the audio. • What we want to do is capture lots of small features (DFTs and others) (say 100 in a second) and then merge them together using means and variances back into one feature vector representing one second. • Enter the STFT or Short Time Fourier Transform

  12. Short Time Fourier Transform (STFT) • Because we want a high number of DFT windows per second, the number of samples for each might get low. Say 44100/100 == 441 samples per window. With so few samples we actually want to use overlapping windows and apply a windowing function to reduce spectral “leakage”.

  13. Short Time Fourier Transform (STFT) (2) Hann window function Rectangle window function Main STFT function (DFT with windowing added)

  14. Visualising the STFT – the Spectrogram! Speech Music

  15. Spectrogram of a violin playing • Human Hearing • Critical Bands... • Timbre and Musical Instruments

  16. Information Overload! • Due to the Nyquist theorem, we still have samplerate/2 == 22050 attributes on our feature vectors. This is way too much. • The first thing we do is “ discretise ” these STFT vectors into x “bins”. This will be 64, 32 or 8 in our experiments. In each bin we simply take the mean average value. • So now we have the rich frequency information broken up into a manageable amount of bins. • Another thing we do on some models (discussed later) is down sample the audio i.e. to 22050Hz before processing it

  17. Richer Feature Extraction • The binned STFT in itself would work as a feature (and we do use it), but we can extract even more from it! • We can write feature detectors that operate in the frequency domain.

  18. Frequency Domain Features (1)

  19. Frequency Domain Features (2) • Spectral Centroid

  20. Frequency Domain Features (3) Entropy Bandwidth Energy Flatness/tonality Rolloff

  21. Means and Variances • We take the means and variances to combine the features back into “textural” feature vectors representing 1 second of underlying audio. Here is an image plot of our “ ModelA ” which produced 221 features.

  22. Final Feature Vectors

  23. Class distribution histograms • None of the features on their own provide good class separation – this is why we need powerful learning machines

  24. Models Descriptions • For comparative purposes we have created 3 models A,B and C with different parameters on the features.

  25. Learning Machines • We are going to try out 3 different classifiers on the 3 models to see which one does best.

  26. Support Vector Machines w/RBF

  27. Bayesian Logistical Regression

  28. C4.5 • Basically just the ID3 algorithm with pruning added • Decision trees were popular in the 80’s • At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists. • The algorithm is based on Occam's razor i.e. it prefers smaller decision trees (simple rules) over larger ones.

  29. Learning machine parameters

  30. Verbose Results

  31. Results Interpretation “ Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of • F-measure on speech class shown above completeness .” • SVM clearly out-performed the other two learning machines • BLR performed strongly on the verbose feature set, but SVM performed well regardless • Interestingly, C4.5 + BLR did better on C, than B! Possibly a smaller feature set translated to better accuracy a Precision score of 1.0 for a class C means that every item labelled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labelled correctly) whereas a Recall of 1.0 means that every item from class C was labelled as belonging to class C (but says nothing about how many other items were incorrectly also labelled as belonging to class C).

  32. Where did it go wrong, improvements • Many classification errors were border cases • Heuristics could be used to improve performance i.e. Assumptions about no gaps in speech • 2-second feature vectors...

  33. Any questions?! • tim@cs.rhul.ac.uk • www.developer-x.com/papers/asot All data, audio samples, these slides, associated paper etc can be downloaded there!

Recommend


More recommend