Multiple birdsong tracking Representing fine modulations Machine listening for birds: analysis techniques matched to the characteristics of bird vocalisations Dan Stowell and Mark D Plumbley Centre for Digital Music School of Elec Eng & Computer Science Queen Mary, University of London June 2013, Listening in the Wild dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 1
Multiple birdsong tracking Representing fine modulations Motivation “Cocktail party” problems. . . dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 2
Multiple birdsong tracking Representing fine modulations Motivation Photo: Shutterstock / Romeo Mikulic dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 3
Multiple birdsong tracking Representing fine modulations Motivation We often have audio with multiple birds, and would like to perform automatic tasks (recognition, tracking, counting. . . ) Existing computational methods don’t quite fit the characteristics of bird vocalisations: 1. Multiple “speakers”, and discontinuous utterances —problematic for methods adapted from speech recognition 2. Birds often use very rapid modulations, yet typical signal representations (spectrograms, MFCCs, LPC) do not capture them dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 4
Multiple birdsong tracking Representing fine modulations Outline 1. Syllable-to-syllable tracking of multiple birds 2. Representing the fine detail of bird vocalisations 8000 6000 4000 2000 dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 5
Multiple birdsong tracking Representing fine modulations Multiple birdsong tracking Chiffchaff ( Phylloscopus collybita ) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 6
Multiple birdsong tracking Representing fine modulations Automatic Speech Recognition Hidden Markov Model: y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 t 1 t 2 t 3 t 4 dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 7 time
Multiple birdsong tracking Representing fine modulations Intermittent polyphonic sources dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 8
Multiple birdsong tracking Representing fine modulations Intermittent polyphonic sources dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 8
Multiple birdsong tracking Representing fine modulations Modelling an intermittent source Markov renewal process (“MRP”): P ( τ n +1 ≤ t , X n +1 = j | ( X 1 , T 1 ) , . . . , ( X n = i , T n ) ) = P ( τ n +1 ≤ t , X n +1 = j | X n = i ) where τ n +1 is the time difference T n +1 − T n . dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 9
Multiple birdsong tracking Representing fine modulations Multiple MRPs Problem sketch: assume multiple MRPs, plus potential “clutter”. Given transition probabilities, find the most likely set of paths. (Max 1 path per node) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 10
Multiple birdsong tracking Representing fine modulations Flow networks, and minimum cost flow a c (X 1 ) V 1 a t (X 1 ,X 3 ,T 3 -T 1 ) a d (X 1 ) a c (X 3 ) a t (X 1 ,X 2 ,T 2 -T 1 ) a b (X 1 ) a d (X 3 ) V 3 t a b (X 3 ) a d (X 2 ) s a t (X 2 ,X 3 ,T 3 -T 2 ) a b (X 2 ) a c (X 2 ) V2 Convert likelihood expression to flow “costs”: a b ( X ) = − log p b ( X ) a d ( X ) = − log p d ( X ) a t ( X , X ′ , τ ) = − log f X ( X ′ , τ ) a c ( X ) = log p c ( X ) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 11
Multiple birdsong tracking Representing fine modulations Flow networks, and minimum cost flow a c (X 1 ) V 1 a t (X 1 ,X 3 ,T 3 -T 1 ) a d (X 1 ) a c (X 3 ) a t (X 1 ,X 2 ,T 2 -T 1 ) a b (X 1 ) a d (X 3 ) V 3 t a b (X 3 ) a d (X 2 ) s a t (X 2 ,X 3 ,T 3 -T 2 ) a b (X 2 ) a c (X 2 ) V2 Convert likelihood expression to flow “costs”: a b ( X ) = − log p b ( X ) a d ( X ) = − log p d ( X ) a t ( X , X ′ , τ ) = − log f X ( X ′ , τ ) a c ( X ) = log p c ( X ) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 11
Multiple birdsong tracking Representing fine modulations Flow networks, and minimum cost flow a c (X 1 ) V 1 a t (X 1 ,X 3 ,T 3 -T 1 ) a d (X 1 ) a c (X 3 ) a t (X 1 ,X 2 ,T 2 -T 1 ) a b (X 1 ) a d (X 3 ) V 3 t a b (X 3 ) a d (X 2 ) s a t (X 2 ,X 3 ,T 3 -T 2 ) a b (X 2 ) a c (X 2 ) V2 Convert likelihood expression to flow “costs”: a b ( X ) = − log p b ( X ) a d ( X ) = − log p d ( X ) a t ( X , X ′ , τ ) = − log f X ( X ′ , τ ) a c ( X ) = log p c ( X ) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 11
Multiple birdsong tracking Representing fine modulations Minimum cost flow Minimum cost flow algorithms can therefore solve this problem: ◮ Optimal minimum-cost flow: Edmonds-Karp algorithm, asymptotic time complexity O ( | V || A | 2 ). ◮ Or use inexact (greedy) algorithm: O ( | V || A | ) or lower. dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 12
LR 6.33e+18 LR 1.45e+21 60 60 60 60 generator: locked 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 Multiple birdsong tracking 10 10 10 10 Representing fine modulations 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 LR 1.42e+12 LR 4.55e+17 Synthetic example generator: coherent 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 LR 6.33e+18 LR 1.45e+21 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 LR 3.11e+16 generator: segregated 60 60 60 60 generator: locked 60 60 60 60 50 50 50 50 50 50 50 50 40 40 40 40 40 40 40 40 30 30 30 30 30 30 30 30 20 20 20 20 20 20 20 20 10 10 10 10 10 10 10 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 LR 1.42e+12 LR 4.55e+17 0 2 4 6 8 10 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 clean signal signal in noise inferred (coherent) inferred (segregated) generator: coherent 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 13 LR 3.11e+16 generator: segregated 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 2 4 6 8 10 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 clean signal signal in noise inferred (coherent) inferred (segregated)
Multiple birdsong tracking Representing fine modulations Birdsong experiment 25 European recordings of Chiffchaff (source: Xeno Canto) Mixtures of 2–5 recordings, 5-fold crossvalidation Can it cluster the “syllables” in the same way as the source audio? dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 14
Multiple birdsong tracking Representing fine modulations Data preparation Syllables detected by spectrogram cross-correlation. 7400 6500 5700 Freq (Hz) 4800 XC25760-dn.xcor 4000 10000 3100 0.05 0.11 0.17 Time (s) 8000 Template 6000 Freq (Hz) 4000 2000 0 0 5 10 15 20 25 Time (s) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 15
Multiple birdsong tracking Representing fine modulations Results 1.0 0.8 0.6 Ftrans 0.4 Ideal recovery, trained on test data Ideal recovery Ideal recovery plus synthetic noise 0.2 Recovery from audio Recovery from audio (greedy) Recovery from audio (baseline) 0.0 1 2 3 4 5 Number of signals in mixture Means and standard errors are shown (5-fold crossvalidation) dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 16
Multiple birdsong tracking Representing fine modulations dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 17
Multiple birdsong tracking Representing fine modulations Representing fine modulations Many (song)birds use very rapid frequency modulation (FM) ◮ Songbirds can perceive fine detail of FM (Dooling et al. 2002, Lohr et al. 2006) ◮ FM detail can affect behavioural responses (Trillo et al. 2005, de Kort et al. 2009) Yet... Standard representations assume local stationarity (i.e. signal parameters unchanging) at fine timescales. ◮ Fourier transform magnitudes (spectrograms, MFCCs) ◮ Linear prediction (LPC) Detail at < 20 ms likely to be smeared or discarded. dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 18
Multiple birdsong tracking Representing fine modulations Representing fine modulations Many (song)birds use very rapid frequency modulation (FM) ◮ Songbirds can perceive fine detail of FM (Dooling et al. 2002, Lohr et al. 2006) ◮ FM detail can affect behavioural responses (Trillo et al. 2005, de Kort et al. 2009) Yet... Standard representations assume local stationarity (i.e. signal parameters unchanging) at fine timescales. ◮ Fourier transform magnitudes (spectrograms, MFCCs) ◮ Linear prediction (LPC) Detail at < 20 ms likely to be smeared or discarded. dan.stowell@eecs.qmul.ac.uk Analysis techniques matched to bird vocalisations 18
Recommend
More recommend