Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. Personal and Consumer Audio 2. Segmenting & Clustering 3. Special-Purpose Detectors 4. Generic Concept Detectors 5. Challenges & Future 2007-07-24 p. /35 1 Analysis of Everyday Sounds - Ellis & Lee
LabROSA Overview Information Extraction Music Environment Recognition Separation Retrieval Signal Machine Processing Learning Speech 2007-07-24 p. /35 2 Analysis of Everyday Sounds - Ellis & Lee
1. Personal Audio Archives • Easy to record everything you hear <2GB / week @ 64 kbps • Hard to find anything how to scan? how to visualize? how to index? • Need automatic analysis • Need minimal impact 2007-07-24 p. /35 3 Analysis of Everyday Sounds - Ellis & Lee
Personal Audio Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week? most frequent conversations favorite phrases? • Retrieving details what exactly did I promise? privacy issues... • Nostalgia • ... or what? 2007-07-24 p. /35 4 Analysis of Everyday Sounds - Ellis & Lee
Consumer Video • Short video clips as the evolution of snapshots 10-60 sec, one location, no editing browsing? • More information for indexing... video + audio foreground + background 2007-07-24 p. /35 5 Analysis of Everyday Sounds - Ellis & Lee
Information in Audio • Environmental recordings contain info on: location – type (restaurant, street, ...) and specific activity – talking, walking, typing people – generic (2 males), specific (Chuck & John) spoken content ... maybe • but not: what people and things “looked like” day/night ... ... except when correlated with audible features 2007-07-24 p. /35 6 Analysis of Everyday Sounds - Ellis & Lee
A Brief History of Audio Processing • Environmental sound classification draws on earlier sound classification work as well as source separation... Speech Recognition Source Separation One channel Multi-channel Speaker ID GMM-HMMs Model-based Cue-based Music Audio Genre & Artist ID Sountrack & Environmental Recognition 2007-07-24 p. /35 7 Analysis of Everyday Sounds - Ellis & Lee
2. Segmentation & Clustering • Top-level structure for long recordings: Where are the major boundaries? e.g. for diary application support for manual browsing • Length of fundamental time-frame 60s rather than 10ms? background more important than foreground average out uncharacteristic transients • Perceptually-motivated features .. so results have perceptual relevance broad spectrum + some detail 2007-07-24 p. /35 8 Analysis of Everyday Sounds - Ellis & Lee
MFCC Features • Need “timbral” features: Mel-Frequency Cepstral Coeffs (MFCCs) auditory-like frequency warping log-domain discrete cosine transform = orthogonalization 2007-07-24 p. /35 9 Analysis of Everyday Sounds - Ellis & Lee
Long-Duration Features Average Linear Energy Normalized Energy Deviation 60 20 120 20 freq / bark 15 freq / bark 15 40 100 10 10 80 20 5 5 60 dB dB Average Log Energy Log Energy Deviation 120 20 20 15 freq / bark 15 100 freq / bark 15 10 10 10 80 5 5 5 60 dB dB Average Spectral Entropy Spectral Entropy Deviation 0.9 20 20 0.5 0.8 freq / bark freq / bark 15 15 0.4 0.7 0.3 10 10 0.6 0.2 5 5 0.5 0.1 bits bits 50 100 150 200 250 300 350 400 450 time / min • Capture both average and variation • Capture a little more detail in subbands... 2007-07-24 p. /35 10 Analysis of Everyday Sounds - Ellis & Lee
Spectral Entropy N F • Auditory spectrum: ∑ A [ n , j ] = w jk X [ n , k ] k = 0 • Spectral entropy ≈ ‘peakiness’ of each band: N F � w jk X [ n , k ] � w jk X [ n , k ] ∑ H [ n , j ] = − · log A [ n , j ] A [ n , j ] k = 0 FFT spectral magnitude 0 energy / dB -20 Auditory Spectrum -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 0.5 rel. entropy / bits 0 per-band Spectral Entropies -0.5 -1 30 340 750 1130 1630 2280 3220 3780 4470 5280 6250 7380 freq / Hz 2007-07-24 p. /35 11 Analysis of Everyday Sounds - Ellis & Lee
BIC Segmentation • BIC (Bayesian Info. Crit.) compares models: log L ( X 1 ; M 1 ) L ( X 2 ; M 2 ) ≷ λ 2 log( N )∆#( M ) L ( X ; M 0 ) 2004-09-10-1023_AvgLEnergy 20 AvgLogAudSpec 15 10 5 boundary passes BIC 0 no boundary BIC score with shorter context -100 -200 13:30 14:00 14:30 15:00 15:30 16:00 time / hr L ( X 2 ; M 2 ) L ( X 1 ; M 1 ) L ( X ; M 0 ) last segmentation candidate current boundary conte xt limit point 2007-07-24 p. /35 12 Analysis of Everyday Sounds - Ellis & Lee
BIC Segmentation Results • Evaluate: 62 hr hand-marked dataset 8 days, 139 segments, 16 categories measure Correct Accept % @ False Accept = 2%: Feature Correct Accept μ dB 80.8% 0.8 o μ H 81.1% 0.7 σ H / μ H 81.6% Sensitivity 0.6 μ dB + σ H / μ H 84.0% 0.5 µ dB μ dB + σ H / μ H + μ H 83.6% 0.4 µ H � H /µ H mfcc 73.6% µ dB + � H /µ H 0.3 µ dB + µ H + � H /µ H 0.2 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 1 - Specificity 2007-07-24 p. /35 13 Analysis of Everyday Sounds - Ellis & Lee
Segment Clustering • Daily activity has lots of repetition: Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 4*5)#1-% 1))%'23 -"#"0-) ,"#,)# ()!%*#)/ ,'(('"#. ;01) ,#)"- ,0:('23 + ()!%*#)+ 4%#))% !"#$%"&' #)4%"*#"2% 768 (',#"#9 7 !"15*4 !15 (', #4% 4%# 666 2007-07-24 p. /35 14 Analysis of Everyday Sounds - Ellis & Lee
Spectral Clustering • Eigenanalysis of affinity matrix: A = U•S•V ′ SVD components: u k •s kk •v k ' Affinity Matrix k=2 k=1 900 800 800 600 700 400 600 200 500 k=3 k=4 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors v k give cluster memberships • Number of clusters? 2007-07-24 p. /35 15 Analysis of Everyday Sounds - Ellis & Lee
Clustering Results • Clustering of automatic segments gives ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences 2007-07-24 p. /35 16 Analysis of Everyday Sounds - Ellis & Lee
Browsing Interface • Browsing / Diary interface links to other information (diary, email, photos) synchronize with note taking? (Stifelman & Arons) audio thumbnails • Release Tools + “how to” for capture '!!(D!%D&$ '!!(D!%D&( '!!(D!%D&) '!!(D!%D&* '!!(D!%D&+ !"#!! !"#$! !%#!! !%#$! ,-./01223 &!#!! 045. <..68=: C' 045. 045. &!#$! 25580. 25580. ,-./01223 <..68=:' 045. &&#!! ,2/63.0 276922- >2= EFG!$ EFG!( 3.067-. 3.067-. 045. &&#$! -9: 3.067-. :-27, 25580. &'#!! 02<,<6: 276922- 25580. 276922- 276922- &'#$! 276922- :-27, <..68=:' 25580. &$#!! 045. :-27, &$#$! 34; /.<8=4- 276922- 276922- C' 045. :-27, <..68=: 276922- 25580. 045. 25580. 25580. &(#!! <..68=:' ?4=7.3 045. ?8H. 25580. ,2/63.0 <..68=: 25580. <..68=: 3.067-.' &(#$! 276922- &)#!! 276922- 25580. @--2A2B F4<;4-64B :-414< :, 25580. 25580. :-27, &)#$! ,2/63.0 25580. :-27, 25580. 25580. &*#!! 25580. <..68=:' C.//.- H.4=/7; &*#$! 045. &+#!! 276922- 34; 2007-07-24 p. /35 17 Analysis of Everyday Sounds - Ellis & Lee
3. Special-Purpose Detectors: Speech • Speech emerges as most interesting content • Just identifying speech would be useful goal is speaker identification / labeling • Lots of background noise conventional Voice Activity Detection inadequate • Insight: Listeners detect pitch track (melody) look for voice-like periodicity in noise coffeeshop excerpt 4000 3000 Frequency 2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time 2007-07-24 p. /35 18 Analysis of Everyday Sounds - Ellis & Lee
Voice Periodicity Enhancement • Noise-robust subband autocorrelation • Subtract local average suppresses steady background e.g. machine noise 15 min test set; 88% acc (no suppression: 79%) also for enhancing speech by harmonic filtering 2007-07-24 p. /35 19 Analysis of Everyday Sounds - Ellis & Lee
Recommend
More recommend