ee e6820 speech audio processing recognition lecture 8
play

EE E6820: Speech & Audio Processing & Recognition Lecture 8: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu>


  1. EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 1

  2. Spatial acoustics 1 • Received sound = source + channel - so far, only considered ideal source waveform • Sound carries information on its spatial origin - e.g. “ripples in the lake” - great evolutionary significance • The basis of scene analysis? - yes and no - try blocking an ear E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 2

  3. Ripples in the lake Listener Source Source Wavefront (@ c m/s) Energy ∝ 1/r 2 • Effect of relative position on sound ∆ - delay = r/c 2 - energy decay ~ 1/r r - absorption ~ G(f) - direct energy plus reflections • Give cues for recovering source position • Describe wavefront by its normal E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 3

  4. Recovering spatial information • Source direction as wavefront normal - moving plane found from timing at 3 points pressure B ∆ t/c = ∆ s = AB·cos θ θ C A time wavefront - need to solve correspondence • Space: need 3 parameters range r - e.g. 2 angles and range elevation φ azimuth θ E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 4

  5. The effect of the environment • Reflection causes additional wavefronts reflection diffraction & shadowing - + scattering, absorption → - many paths many echoes • Reverberant effect - causal ‘smearing’ of signal energy Dry speech airvib16 + reverb from hlwy16 8000 8000 freq / Hz freq / Hz 6000 6000 4000 4000 2000 2000 0 0 0 0.5 1 1.5 0 0.5 1 1.5 time / sec time / sec E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 5

  6. Reverberation impulse response • Exponential decay of reflections: hlwy16 - 128pt window ~e- t/T 8000 -10 freq / Hz h room (t) -20 6000 -30 4000 -40 -50 2000 -60 t 0 -70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s • Frequency-dependent - greater absorption at high frequencies → faster decay • Size-dependent → → - larger rooms longer delays slower decay • Sabine’s equation: 0.049 V = - - - - - - - - - - - - - - - - - RT 60 S α • Time constant as size, absorption E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 6

  7. Outline 1 Spatial acoustics 2 Binaural perception - The sound at the two ears - Available cues - Perceptual phenomena 3 Synthesizing spatial audio 4 Extracting spatial sounds E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 7

  8. Binaural perception 2 R L head shadow (high freq) path length path length difference difference source • What is the information in the 2 ear signals? - the sound of the source(s) (L+R) - the position of the source(s) (L-R) • Example waveforms (ShATR database) shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235 time / s E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 8

  9. Main cues to spatial hearing • Interaural time difference (ITD) - from different path lengths around head - dominates in low frequency (< 1.5 kHz) µ → - max ~ 750 s ambiguous for freqs > 600 Hz • Interaural intensity difference (IID) - from head shadowing of far ear - negligable for LF; increases with frequency • Spectral detail (from pinna relfections) useful for elevation & range • Direct-to-reverberant useful for range E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 9

  10. Head-Related Transfer Fns (HRTFs) • Capture source coupling as impulse responses { ( ) r θ φ R , ( ) } l θ φ R t t , , , , • Collection: ( ) http://phosphor.cipic.ucdavis.edu/ HRIR_021 Left @ 0 el HRIR_021 Right @ 0 el 1 RIGHT HRIR_021 Left @ 0 el 0 az 45 0 Azimuth / deg 0 1 HRIR_021 Right @ 0 el 0 az 0 -45 LEFT -1 time / ms time / ms 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 • Highly individual! E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 10

  11. Cone of confusion azimuth θ Cone of confusion (approx equal ITD) • Interaural timing cue dominates (below 1kHz) - from differing path lengths to two ears • But: only resolves to a cone - Up/down? Front/back? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 11

  12. Further cues • Pinna causes elevation-dependent coloration • Monaural perception - separate coloration from source spectrum? • Head motion - synchronized spectral changes - also for ITD (front/back) etc. E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 12

  13. Combining multiple cues • Both ITD and ILD influence azimuth; What happens when they disagree? Identical signals to both ears → image is centered l ( t ) r ( t ) t t 1 ms Delaying right channel moves image to left l ( t ) r ( t ) t t Attenuating left channel returns image to center l ( t ) r ( t ) t t - trading @ around 0.1 ms / dB E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 13

  14. Binaural position estimation • Imperfect results: (Arruda, Kistler & Wightman 1992) 180 Judged Azimuth (Deg) 120 60 0 -60 -120 -180 -180 -120 -60 0 0 60 120 180 Target Azimuth (Deg) - listening to ‘wrong’ hrtfs → errors - front/back reversals stay on cone of confusion E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 14

  15. The Precedence Effect • Reflections give misleading spatial cues l ( t ) direct reflected t r ( t ) R R/c t • But: Spatial impression based on 1st wavefront then ‘switches off’ for ~50 ms - .. even if ‘reflections’ are louder - .. leads to impression of room E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 15

  16. Binaural Masking Release • Adding noise to reveal target Tone + noise to one ear: tone is masked t + t Identical noise to other ear: tone is audible t + t t - why does this make sense? • Binaural Masking Level Difference up to 12dB - greatest for noise in phase, tone anti-phase E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 16

  17. Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio - Position - Environment 4 Extracting spatial sounds E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 17

  18. Synthesizing spatial audio 3 • Goal: recreate realistic soundfield - hi-fi experience - synthetic environments (VR) • Constraints - resources - information (individual HRTFs) - delivery mechanism (headphones) • Source material types - live recordings (actual soundfields) - synthetic (studio mixing, virtual environments) E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 18

  19. Classic stereo L R • ‘Intensity panning’: no timing modifications, just vary level ±20 dB - works as long as listener is equidistant • Surround sound: extra channels in center, sides, ... - same basic effect - pan between pairs E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 19

  20. Simulating reverberation • Can characterize reverb by impulse response - spatial cues are important - record in stereo - IRs of ~ 1 sec → very long convolution • Image model: reflections as duplicate sources virtual (image) sources reflected path source listener • ‘Early echos’ in room impulse response: direct path early echos h room (t) t Actual reflection may be h ref (t), not δ (t) • E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 20

  21. Artificial reverberation • Reproduce perceptually salient aspects - early echo pattern ( → room size impression) - overall decay tail ( → wall materials...) - interaural coherence ( → spaciousness) • Nested allpass filters (Gardner ’92) Allpass z -k - g H(z) = -g 1 - g·z -k y[n] x[n] 1-g 2 z -k + + h[n] g(1-g 2 ) g 2 (1-g 2 ) k 2k 3k n g g,k -g Synthetic Reverb Nested+Cascade Allpass + + 50,0.5 30,0.7 a 0 a 1 a 2 20,0.3 + AP 0 AP 1 AP 2 g LPF E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 21

  22. Synthetic binaural audio • Source convolved with {L,R} HRTFs gives precise positioning - ...for headphone presentation - can combine multiple sources (by adding) • Where to get HRTFs? - measured set, but: specific to individual, discrete - interpolate by linear crossfade, PCA basis set - or: parametric model - delay, shadow, pinna Delay Shadow Pinna 1 - b L ( θ ) z -1 z - t DL ( θ ) Σ p kL ( θ , φ )· z - t PkL ( θ , φ ) + 1 - az t Room echo K E ·z - t E Source 1 - b R ( θ ) z -1 z - t DR ( θ ) Σ p kR ( θ , φ )· z - t PkR ( θ , φ ) + 1 - az t (after Brown & Duda '97) • Head motion cues? - head tracking + fast updates E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 22

  23. Transaural sound • Binaural signals without headphones? • Can cross-cancel wrap-around signals - speakers S L,R , ears E L,R , binaural signals B L,R . B R B L – 1 ( ) M = – S L H LL B L H RL S R – 1 ( ) = – S L S R S R H RR B R H LR S L H LR H RL H RR H LL E L E R • Narrow ‘sweet spot’ - head motion? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01 - 23

Recommend


More recommend