Auditory Scene Analysis: phenomena, theories and computational models July 1998 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 The computational theory of ASA 2 Cues & grouping 3 Expectations & inference 4 Big issues ASA - Dan Ellis 1998jul11 - 1
Auditory Scene Analysis What does our sense of hearing do? - recover useful information ... about objects of interest ... in a wide range of circumstances Measuring objects in an auditory scene: ASA - Dan Ellis 1998jul11 - 2
Subjective analysis of auditory scenes f/Hz City 4000 2000 1000 400 200 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) S9−horn 2 S10−car horn S4−horn1 S6−double horn S2−first double horn S7−horn S7−horn2 S3−1st horn S5−Honk S8−car horns S1−honk, honk Crash (10/10) S7−gunshot S8−large object crash S6−slam S9−door Slam? S2−crash S4−crash S10−door slamming S5−Trash can S3−crash (not car) S1−slam Horn2 (5/10) S9−horn 5 S8−car horns S2−horn during crash S6−doppler horn S7−horn3 Truck (7/10) S8−truck engine S2−truck accelerating S5−Acceleration S1−rev up/passing S6−acceleration S3−closeup car S10−wheels on road Horn3 (5/10) S7−horn4 S9−horn 3 S8−car horns S3−2nd horn S10−car horn • Subjects identify structures in dense scenes with high agreement ASA - Dan Ellis 1998jul11 - 3
Outline 1 The computational theory of ASA - ASA and CASA - The grouping paradigm - Marr’s three levels of explanation 2 Cues & grouping 3 Expectations & inference 4 Big issues ASA - Dan Ellis 1998jul11 - 4
Auditory Scene Analysis (ASA) “The organization of sound scenes according to their inferred sources” • Real-world sounds rarely occur in isolation → a useful sense of hearing must be able to segregate mixtures - people (and ...) do this very well; unexpectedly difficult to model - depends on: subjective definition of relevant sources regularity/constraints of real-world sounds • Studied via experimental psychology - characterize ‘rules’ for organizing simple pieces (tones, noise bursts, clicks) i.e. ‘reductive’ approach ASA - Dan Ellis 1998jul11 - 5
Computational Auditory Scene Analysis (CASA) • Psychological ‘rules’ suggest computer implementation - .. but many practical problems arise! • Motivations: Practical applications - real-world interactive systems - indexing of media databases - hearing prostheses Crossover opportunities - unknown signal/information processing principles? Benefits for theory - implementations are very revealing ASA - Dan Ellis 1998jul11 - 6
The grouping paradigm • Standard theory of ASA (Bregman, Darwin &c): - sound mixture is broken up into small elements e.g. time-frequency ‘cells’ - each element has a number of feature dimensions (amplitude, ITD, period) - elements are grouped together according to their features to form larger structures - resulting groups have overall attributes (pitch, location) (from Darwin 1996) ASA - Dan Ellis 1998jul11 - 7
Marr’s levels-of-explanation of information processing • Three distinct aspects to info. processing Sound Computational ‘what’ and ‘why’; source Theory the overall goal organization ‘how’; Auditory Algorithm an approach to grouping meeting the goal practical Feature Implementation realization of the calculation & process. binding Why bother? - to help organize understanding - avoid confusion/wasted effort → use as an analysis tool... ASA - Dan Ellis 1998jul11 - 8
Level 1: Computational theory • The underlying regularities that make the problem possible - i.e. the ‘ecological’ facts • Implicit definition of “what is a source?”: Independence of attributes between sources Continuity of attributes for each source + other source-specific constraints ASA - Dan Ellis 1998jul11 - 9
Level 2: Algorithm • A particular approach to exploiting the constraints of the computational theory - both process & representation • Audition: the “elements-then-grouping” approach - could have been otherwise e.g. templates • Often the focus of analysis - but: debate is muddled without a clear computational theory ASA - Dan Ellis 1998jul11 - 10
Level 3: Implementation • A specific realization of the algorithm - computer programs - neurons - ... • Can be analyzed separately? - provided epiphenomena are correctly assigned • Needs context of algorithm, computational theory “You cannot understand stereopsis simply by thinking about neurons” ASA - Dan Ellis 1998jul11 - 11
The advantage of the appropriate level • Computational theory - determines the purpose of the process; provides focus necessary for analysis e.g. biosonar: benefit of hyperresolution • Algorithm - abstraction that is still specific, transferable e.g. autocorrelation for pitch • Implementation - explain ‘epiphenomena’ e.g. ‘subjective octave’ from refractory period ASA - Dan Ellis 1998jul11 - 12
An example: Neural inhibition Frequency- X(f) Computational domain theory processing f Discrete-time Algorithm filtering (subtraction) Neurons with Implementation GABAergic inhibitions ASA - Dan Ellis 1998jul11 - 13
Summary 1 • Acoustic scenes are very complex • .. but the auditory system extracts useful information • Grouping is the main focus of Auditory Scene Analysis • .. but it fits into a larger Marrian framework ASA - Dan Ellis 1998jul11 - 14
Outline 1 The computational theory of ASA 2 Cues & grouping - Cue analysis - Simple scenes - Models - Complications: interaction, ambiguity, time 3 Expectations & inference 4 Big issues ASA - Dan Ellis 1998jul11 - 15
Cues to grouping • Common onset/offset/modulation (“fate”) • Common periodicity (“pitch”) Common onset Periodicity Acoustic (Nonlinear) cyclic Computational consequences tend processes are theory to be synchronized common Group elements that ? Place patterns Algorithm start in a time range ? Autocorrelation Onset detector cells ? Delay-and-mult Implementation Synchronized osc’s? ? Modulation spect • Spatial location (ITD, ILD, spectral cues) • Sequential cues... • Source-specific cues... ASA - Dan Ellis 1998jul11 - 16
Simple grouping • E.g. isolated tones freq time Computational • common onset theory • common period (harmonicity) • locate elements (tracks) Algorithm • group by shared features ? exhaustive search Implementation • evolution in time ASA - Dan Ellis 1998jul11 - 17
Computer models of grouping • “Bregman at face value” (e.g. Brown 1992): signal discrete input features objects mixture Source Object Grouping Front end formation rules groups (maps) freq onset time period frq.mod - feature maps - periodicity cue - common-onset boost - resynthesis ASA - Dan Ellis 1998jul11 - 18
Grouping model results • Able to extract voiced speech: brn1h.aif brn1h.fi.aif frq/Hz frq/Hz 3000 3000 2000 2000 1500 1500 1000 1000 600 600 400 400 300 300 200 200 150 150 100 100 0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s • Periodicity is the primary cue - how to handle aperiodic energy? • Limitations - resynthesis via filter-mask - only periodic targets - robustness of discrete objects ASA - Dan Ellis 1998jul11 - 19
Complications for grouping: 1: Cues in conflict • Mistuned harmonic (Moore, Darwin..): freq time - harmonic usually groups by onset & periodicity - can alter frequency and/or onset time - ‘degree of grouping’ from overall pitch match • Gradual, various results: pitch shift mistuning 3% - heard as separate tone, still affects pitch ASA - Dan Ellis 1998jul11 - 20
Complications for grouping: 2: The effect of time • Added harmonics: freq time - onset cue initially segregates; periodicity eventually fuses • The effect of time - some cues take time to become apparent - onset cue becomes increasingly distant... • What is the impetus for fission? - e.g. double vowels - depends on what you expect .. ? ASA - Dan Ellis 1998jul11 - 21
Summary 2 • Known grouping cues make sense • Simple examples are straightforward • Models can be implemented directly • .. but problematic situations abound ASA - Dan Ellis 1998jul11 - 22
Outline 1 The computational theory of ASA 2 Cues & grouping 3 Expectations & inference - “Old-plus-new” - Streaming - Restoration & illusions - Top-down models 4 Big issues ASA - Dan Ellis 1998jul11 - 23
The effect of context • Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible freq/kHz 2 1 + 0 0.0 0.4 0.8 1.2 time/s - a different division of the same energy depending on what preceded it ASA - Dan Ellis 1998jul11 - 24
Recommend
More recommend