11-755 Machine Learning for Signal Processing Sparse Overcomplete, Shift- and Transform-Invariant Representations Class 15. 14 Oct 2009
Recap: Mixture-multinomial model n The basic model: Each frame in the magnitude spectrogram is a histogram drawn from a mixture of multinomial (urns) The probability distribution used to draw the spectrum for q the t-th frame is: = � SOURCE specific ( ) ( ) ( | ) P f P z P f z Frame-specific t t z bases spectral distribution Frame(time) specific mixture weight 11-755 MLSP: Bhiksha Raj
Recap: Mixture-multinomial model 91 411 501 50291 411 501 502 91 411 501 502 91 411 501 502 515 515 515 515 127 27 101 203 127 27 101 203 127 27 101 203 127 27 101 203 24 477 24 477 24 477 24 477 69 69 69 69 5 5 598 1 274 1 7520 91 501 444 453 99 453 37 411 502 515 15 164 81 147 327 1 147 38 1 127 27 101 81 224 111 203 8 6 224 47 201 737 24 477 399 369 69 The individual multinomials represent the “spectral bases” that n compose all signals generated by the source E.g., they may be the notes for an instrument q More generally, they may not have such semantic interpretation q 11-755 MLSP: Bhiksha Raj
Recap: Learning Bases 5 5 598 1 274 1 7520 91 501 453 453 411 502 444 99 37 515 15 164 81 147 327 1 147 38 1 127 27 101 81 224 111 203 8 224 201 24 6 47 737 477 399 369 69 n Learn bases from example spectrograms n Initialize bases (P(f|z)) for all z, for all f n For each frame, initialize P t (z) n Iterate = � P z P f ( ) ( | ) z t P z f ( | ) t P z P f ( ') ( | ') z t z ' = � � P z f S ( | ) ( ) f P z f S ( | ) ( ) f t t t t t ( | ) f P f z �� P z ( ) = �� ( | ') ( ') t P z f S f P z ( '| f S ) ( ) f t t t t f ' t z ' f 11-755 MLSP: Bhiksha Raj
Bases represent meaning spectral structures bases Basis-specific spectrograms Speech Signal 5 55 98 444 1 2 74 453 1 99 7 520 453 37 91 411 501 502 515 15 164 81 147 327 1 147 38 1 127 27 101 81 224 111 203 8 6 224 47 201 37 24 477 399 369 7 69 P(f|z) From Bach’s Fugue in Gm Frequency fi P t (z) Time fi 11-755 MLSP: Bhiksha Raj
How about non-speech data 19x19 images = 361 dimensional vectors We can use the same model to represent other data n Images: n Every face in a collection is a histogram q Each histogram is composed from a mixture of a fixed number of q multinomials All faces are composed from the same multinomials, but the manner in which the n multinomials are selected differs from face to face Each component multinomial is also an image q And can be learned from a collection of faces n Component multinomials are observed to be parts of faces n 11-755 MLSP: Bhiksha Raj
How many bases can w e learn n The number of bases that must be learned is a fundamental question How do we know how many bases to learn q How many bases can we actually learn computationally q n A key computational problem in learning bases: The number of bases we can learn correctly is restricted by q the dimension of the data I.e., if the spectrum has F frequencies, we cannot estimate q more than F-1 component multinomials reliably Why? n 11-755 MLSP: Bhiksha Raj
Indeterminacy in Learning Bases 3 3 3 Consider the four histograms n 2 2 2 2 2 to the right 1 1 1 1 All of them are mixtures of the n same K component multinomials B1 B2 For K < 3, a single global n solution may exist I.e there may be a unique set q of component multinomials c*B1+d*B2 that explain all the e*B1+f*B2 multinomials With error – model will not be g*B1+h*B2 n perfect i*B1+j*B2 For K = 3 a trivial solution n exists 11-755 MLSP: Bhiksha Raj
Indeterminacy C*B1+C*2*B2+C*3*B3 = 0.5B1+0.33B2+0.17B3 Multiple solutions for K = 3.. n 0.5B1+0.17B2+0.33B3 We cannot learn a non- q 0.33B1+0.5B2+0.17B3 trivial set of “optimal” bases from the histograms 0.4B1+0.2B2+0.4B3 The component q multinomials we do learn tell 3 3 3 us nothing about the data 2 2 2 2 2 1 1 1 1 For K > 3, the problem only n gets worse An inifinite set of solutions q B1 B2 B3 are possible 1 1 1 E.g. the trivial solution plus 0 0 0 0 0 0 n a random basis 11-755 MLSP: Bhiksha Raj
Indeterminacy in signal representations n Spectra: If our spectra have D frequencies (no. of unique indices in q the DFT) then.. We cannot learn D or more meaningful component q multinomials to represent them The trivial solution will give us D components, each of which n has probability 1.0 for one frequency and 0 for all others This does not capture the innate spectral structures for the n source n Images: Not possible to learn more than P-1 meaningful component multinomials from a collection of P-pixel images 11-755 MLSP: Bhiksha Raj
Overcomplete Representations Representations where there are more bases than dimensions n are called Overcomplete E.g. more multinomial components than dimensions q More L2 bases (e.g. Eigenvectors) than dimensions q More non-negative bases than dimensions q Overcomplete representations are difficult to compute n Straight-forward computation results in indeterminate solutions q Overcomplete representations are required to represent the n world adequately The complexity of the world is not restricted by the dimensionality q of our representations! 11-755 MLSP: Bhiksha Raj
How many bases to represent sounds/images? In each case, the bases represent “typical unit structures” n Notes q Phonemes q Facial features.. q To model the data well, all of these must be represented n How many notes in music n Several octaves q Several instruments q The total number of notes required to represent all “typical” sounds in n music are in the thousands The typical sounds in speech – n Many phonemes, many variations, can number in the thousands q Images: n Millions of units that can compose an image – trees, dogs, walls, sky, etc. q etc. etc… 11-755 MLSP: Bhiksha Raj
How many can w e learn Typical Fourier representation of sound: 513 (or less) unique n frequencies I.e. no more than 512 unique bases can be learned reliably q These 512 bases must represent everything q Including the units of music, speech, and the other sounds in the n world around us Depending on what we’re attempting to model q Typical “tiny” image: 100x100 pixels n 10000 pixels q I.e. no more than 9999 distinct bases can be learned reliably q But the number of unique entities that can be represented in a q 100x100 image is countless! We need overcomplete representations to model these data well n 11-755 MLSP: Bhiksha Raj
Learning Overcomplete Representations n Learning more multinomial components than dimensions (frequencies or pixels) in the data leads to indeterminate or useless solution n Additional criteria must be imposed in the learning process to learn more components than dimensions Impose additional constraints that will enable us to obtain q meaningful solutions n We will require our solutions to be sparse 11-755 MLSP: Bhiksha Raj
SPARSE Decompositions 5 5 598 1 274 1 520 5 5 598 1 274 1 520 5 5 598 1 274 1 520 91 501 453 7 453 453 7 453 453 7 453 411 502 444 99 37 444 99 37 444 99 37 515 15 164 81 147 327 1 147 38 1 15 164 81 147 327 1 147 38 1 15 164 81 147 327 1 147 38 1 127 27 101 81 224 111 81 224 111 81 224 111 203 8 6 224 47 201 37 8 6 224 47 201 37 8 6 224 47 201 37 24 477 399 369 7 399 369 7 399 369 7 69 Allow any arbitrary number of bases (urns) n Overcomplete q Specify that for any specific frame only a small number of bases may be n used Although there are many spectral structures, any given frame only has a few of q these In other words, the mixture weights with which the bases are combined n must be sparse Have non-zero value for only a small number of bases q Alternately, be of the form that only a small number of bases contribute q significantly 11-755 MLSP: Bhiksha Raj
Recommend
More recommend