EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: detection & discrimination 4 Pitch perception 5 Auditory organization & scene analysis 6 Speech perception Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 1
Why study perception? 1 • Perception is messy: Can we avoid it? No! • Audition provides the ‘ground truth’ in audio - what is relevant and irrelevant - subjective importance of distortion (coding etc.) - (there could be other information in sound...) • Some sounds are ‘designed’ for audition - co-evolution of speech and hearing • The auditory system is very successful - we would do extremely well to duplicate it • We are now able to model complex systems - faster computers, bigger memories E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 2
How to study perception? Three different approaches: • Analyze the example: physiology - dissection & nerve recordings • Black box input/output: psychophysics - fit simple models of simple functions • Information processing models - investigate and model complex functions - e.g. scene analysis, speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 3
Outline 1 Motivation 2 Physiology - Outer, middle & inner ear - The Auditory Nerve and beyond - Models 3 Psychophysics 4 Pitch perception 5 Scene analysis 6 Speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 4
Physiology 2 • Processing chain from air to brain: Middle ear Auditory nerve Cortex Outer Midbrain ear Inner ear • Study via: - anatomy - nerve recordings • Signals flow in both directions E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 5
Outer & middle ear Ear canal Middle ear Pinna bones Eardrum (tympanum) • Pinna ‘horn’ - complex reflections give spatial (elevation) cues • Ear canal - acoustic tube • Middle ear - bones provide impedance matching E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 6
Inner ear: Cochlea Oval window Basilar Membrane (from ME bones) (BM) Travelling wave Cochlea 16 kHz Resonant frequency 50 Hz 0 Position 35mm • Mechanical input from middle ear starts traveling wave moving down Basilar Membrane • Varying stiffness and mass of BM gives results in continuous variation of resonant frequency • At resonance, traveling wave energy is dissipated in BM movement → Frequency (Fourier) analysis E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 7
Cochlea hair cells • Ear converts sound in BM motion; Each point on BM corresponds to a frequency Cochlea Tectorial membrane Basilar membrane Auditory nerve Inner Hair Cell (IHC) Outer Hair Cell (OHC) • Hair cells on BM convert motion into nerve impulses (firings) • Inner Hair Cells detect motion • Outer Hair Cells? Variable damping? [Allen simulation] E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 8
Inner Hair Cells • IHCs convert BM motion into nerve firings • Human hear has ~3500 IHCs; Each IHC has ~7 connections to Auditory Nerve • Each nerve fires (sometimes) near peak displacement: Local BM displacement 50 time / ms Typical nerve signal (mV) • Histogram to get firing probability: Firing count Cycle angle E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 9
Auditory nerve (AN) signals • Single nerve measurements: Tone burst histogram Frequency threshold Spike dB SPL count 80 100 60 40 Time 20 100 ms 1 kHz 100 Hz 10 kHz (log) frequency Tone burst (approx. One fiber: ~ 25 dB dynamic range constant-Q) 300 Spikes/sec Rate vs. 200 intensity 100 Intensity / dB SPL 0 0 20 40 60 80 100 Hearing dynamic range > 100 dB • Hard to measure: probe living ANs E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 10
AN population response • All the information the brain has about sound: - average rate & spike timings on 30,000 fibers • Not unlike a (constant-Q) spectrogram? ( ) 5 freq / 8ve re 100 Hz 4 3 2 1 0 time / ms 0 10 20 30 40 50 60 E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 11
Beyond the auditory nerve • Ascending descending and • Tonotopic x ? - modulation - position - source?? E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 12
Periphery models IHC IHC Outer/middle Cochlea Sound ear filterbank filtering • Modeled aspects: - outer/middle ear - cochlea filtering - hair cell transduction - efferent feedback? • Result: ‘neurogram’ / ‘cochleagram’ SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 60 50 40 channel 30 20 10 0 0.1 0.2 0.3 0.4 0.5 time / s E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 13
Outline 1 Motivation 2 Physiology 3 Psychophysics - Detection theory modeling - Intensity perception - Masking 4 Pitch perception 5 Scene analysis 6 Speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 14
Psychophysics 3 • Physiology looks at the implementation; Psychology looks at the function/behavior p ω O ( ) • Analyze audition as : signal detection - psychological tests reflect internal decisions - assume optimal decision process - infer nature of internal representations, noise, ... → lower bounds on more complex functions • Different aspects to measure - time, frequency, intensity - tones, complexes, noise - binaural - pitch, detuning E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 15
Basic psychophysics • Relate physical and perceptual variables → - e.g. intensity loudness → frequency pitch • Methodology: subject tests - just noticeable difference (jnd) - magnitude scaling e.g. ‘adjust to twice as loud’ • Results for Loudness vs. Intensity: ∆ I α I → log( L ) = k ·log( I ) Weber’s law Hartmann(1993) Classroom loudness scaling data ( ) ( ) 2.6 log L = 0.3 log I 2 2 Textbook figure: 2.4 L α I 0.3 Log(loudness rating) log I ⋅ = 0.3 - - - - - - - - - 10 - - - - - 2.2 log 2 10 2.0 Power law fit: 0.3 - dB L α I 0.22 ⋅ - - - - - - - - - - - - - - - - - - - - = 1.8 log 2 10 10 1.6 ⁄ = dB 10 1.4 -20 -10 0 10 Sound level / dB E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 16
Loudness as a function of frequency • Fletcher-Munson equal-loudness curves: 120 Intensity / dB SPL 100 80 60 40 20 0 100 1000 10,000 freq / Hz 100 100 100 Hz 1 kHz Equivalent loudness Equivalent loudness 80 80 @ 1kHz @ 1kHz 60 60 rapid 40 40 loudness 20 growth 20 0 0 0 40 80 0 40 80 20 60 20 60 Intensity / dB Intensity / dB • Hearing impairment: exaggerates E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 17
Loudness as a function of bandwidth • Same total energy, different distribution: freq time Same mag mag I0 total energy I1 I·B freq freq B0 B1 Loudness ... but wider perceived as louder Bandwidth B ‘Critical’ bandwidth - e.g. 2 chans at -6 dB (not -10 dB) • Critical bands: independent freq. channels - ~ 25 total (4-6 / octave) [sndex] E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 18
Simultaneous masking • A louder tone can ‘mask’ the perception of a second tone nearby in frequency: masking tone absolute threshold Intensity / dB masked threshold log freq • Suggests an ‘internal noise’ model: p ( x | I ) p ( x | I+ ∆ I ) p ( x | I ) internal noise σ n decision variable I x E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 19
Sequential masking • Backward/forward in time: masker envelope simultaneous masking ~10 dB Intensity / dB masked threshold time backward masking forward masking ~5 ms ~100 ms - suggests temporal envelope of decision var. → Time-frequency masking ‘skirt’: Masking tone intensity freq Masked threshold time E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 20
What we do and don’t hear “two-interval forced-choice”: A B X X = A or B? time • Timing: 2ms attack resolution, 20ms discrim - but: spectral splatter • Tuning: ~ 1% discrimination - but: beats • Spectrum: profile changes, formants - variable time-frequency resolution • Harmonic phase • Noisy signals & texture • (Trace vs. categorical memory) E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 21
Outline 1 Motivation 2 Physiology 3 Psychophysics 4 Pitch perception - ‘Place’ models - ‘Time’ models - Multiple cues & competition 5 Scene analysis 6 Speech perception E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 22
Pitch perception: 4 A classic argument in psychophysics • Harmonic complexes are a pattern on AN 70 60 freq. chan. 50 40 30 20 10 0.1 time/s 0 0.05 - .. but give a fused percept (ecological) • What determines the pitch percept? - not the fundamental • How is it computed? Two competing models: place and time E6820 SAPR - Dan Ellis L04 - Perception 2002-02-18 - 23
Recommend
More recommend