Robust nonnegative matrix factorisation with the β -divergence and applications in imaging C´ edric F´ evotte Institut de Recherche en Informatique de Toulouse Imaging & Machine Learning Institut Henri Poincar´ e April 2019
Outline Generalities Matrix factorisation models Nonnegative matrix factorisation (NMF) Optimisation for NMF Measures of fit Majorisation-minimisation Applications in imaging Hyperspectral unmixing in remote sensing Factor analysis in dynamic PET 2
Matrix factorisation models Data often available in matrix form. coefficient s e r u t a e f samples 3
Matrix factorisation models Data often available in matrix form. movie s rating e i 4 v o m users 4
Matrix factorisation models Data often available in matrix form. word count s d 57 r o w text documents 5
Matrix factorisation models Data often available in matrix form. Fourier s e i coefficient c n e 0.3 u q e r f time 6
Matrix factorisation models dictionary learning ≈ low-rank approximation factor analysis latent semantic analysis data X dictionary W activations H ≈ 7
Matrix factorisation models dictionary learning ≈ low-rank approximation factor analysis latent semantic analysis data X dictionary W activations H ≈ 8
Matrix factorisation models for dimensionality reduction (coding, low-dimensional embedding) ≈ 9
Matrix factorisation models for unmixing (source separation, latent topic discovery) ≈ 10
Matrix factorisation models for interpolation (collaborative filtering, image inpainting) ≈ 11
Nonnegative matrix factorisation K patterns H N samples F features ≈ V W ◮ data V and factors W , H have nonnegative entries. ◮ nonnegativity of W ensures interpretability of the dictionary, because patterns w k and samples v n belong to the same space. ◮ nonnegativity of H tends to produce part-based representations, because subtractive combinations are forbidden. Early work by (Paatero and Tapper, 1994) , landmark Nature paper by (Lee and Seung, 1999) 12
NMF for latent semantic analysis (Lee and Seung, 1999; Hofmann, 1999) court president government served council governor culture secretary supreme senate constitutional congress Encyclopedia entry: rights presidential 'Constitution of the justice elected United States' president (148) flowers disease congress (124) leaves behaviour power (120) plant glands united (104) perennial contact ≈ constitution (81) flower symptoms amendment (71) plants skin government (57) growing pain law (49) annual infection ≈ × v n W h n reproduced from (Lee and Seung, 1999) 13
NMF for audio spectral unmixing (Smaragdis and Brown, 2003) Input music passage 20000 16000 6000 3500 Frequency (Hz) Component 2000 1000 500 100 4 3 2 1 0.5 1 1.5 2 2.5 3 Frequency Time (sec) 4 Component 3 2 1 reproduced from (Smaragdis, 2013) 14
NMF for hyperspectral unmixing (Berry, Browne, Langville, Pauca, and Plemmons, 2007) reproduced from (Bioucas-Dias et al., 2012) 15
Outline Generalities Matrix factorisation models Nonnegative matrix factorisation (NMF) Optimisation for NMF Measures of fit Majorisation-minimisation Applications in imaging Hyperspectral unmixing in remote sensing Factor analysis in dynamic PET 16
NMF as a constrained minimisation problem Minimise a measure of fit between V and WH , subject to nonnegativity: � W , H ≥ 0 D ( V | WH ) = min d ([ V ] fn | [ WH ] fn ) , fn where d ( x | y ) is a scalar cost function, e.g., ◮ squared Euclidean distance (Paatero and Tapper, 1994; Lee and Seung, 2001) ◮ Kullback-Leibler divergence (Lee and Seung, 1999; Finesso and Spreij, 2006) ◮ Itakura-Saito divergence (F´ evotte, Bertin, and Durrieu, 2009) ◮ α -divergence (Cichocki et al., 2008) ◮ β -divergence (Cichocki et al., 2006; F´ evotte and Idier, 2011) ◮ Bregman divergences (Dhillon and Sra, 2005) ◮ and more in (Yang and Oja, 2011) Regularisation terms often added to D ( V | WH ) for sparsity, smoothness, dynamics, etc. Nonconvex problem. 17
Probabilistic models ◮ Let V ∼ p ( V | WH ) such that ◮ E[ V | WH ] = WH ◮ p ( V | WH ) = � fn p ( v fn | [ WH ] fn ) ◮ then the following correspondences apply with D ( V | WH ) = − log p ( V | WH ) + cst data support distribution/noise divergence examples real-valued additive Gaussian squared Euclidean many multinomial ⋆ integer weighted KL word counts integer Poisson generalised KL photon counts multiplicative nonnegative Itakura-Saito spectrogram Gamma generally generalises Tweedie β -divergence nonnegative above models ⋆ conditional independence over f does not apply 18
The β -divergence A popular measure of fit in NMF (Basu et al., 1998; Cichocki and Amari, 2010) x β + ( β − 1) y β − β x y β − 1 � 1 � β ∈ R \{ 0 , 1 } β ( β − 1) d β ( x | y ) def x log x = y + ( y − x ) β = 1 y − log x x y − 1 β = 0 Special cases: ◮ squared Euclidean distance ( β = 2) ◮ generalised Kullback-Leibler (KL) divergence ( β = 1) ◮ Itakura-Saito (IS) divergence ( β = 0) Properties: ◮ Homogeneity: d β ( λ x | λ y ) = λ β d β ( x | y ) ◮ d β ( x | y ) is a convex function of y for 1 ≤ β ≤ 2 ◮ Bregman divergence 19
The β -divergence d(x=1|y) 1 β = 2 (Euc) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 20
The β -divergence d(x=1|y) 1 β = 2 (Euc) 0.9 β = 1 (KL) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 21
The β -divergence d(x=1|y) 1 β = 2 (Euc) 0.9 β = 1 (KL) β = 0 (IS) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 22
The β -divergence d(x=1|y) 1 β = 2 (Euc) 0.9 β = 1 (KL) β = 0 (IS) 0.8 β = −1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 23
The β -divergence d(x=1|y) 1 β = 2 (Euc) 0.9 β = 1 (KL) β = 0 (IS) 0.8 β = −1 β = 3 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 24
Common NMF algorithm design ◮ Block-coordinate update of H given W ( i − 1) and W given H ( i ) . ◮ Updates of W and H equivalent by transposition: V ≈ WH ⇔ V T ≈ H T W T ◮ Objective function separable in the columns of H or the rows of W : � D ( V | WH ) = D ( v n | Wh n ) n ◮ Essentially left with nonnegative linear regression: h ≥ 0 C ( h ) def min = D ( v | Wh ) Numerous references in the image restoration literature, e.g., (Richardson, 1972; Lucy, 1974; Daube-Witherspoon and Muehllehner, 1986; De Pierro, 1993) Block-descent algorithm, nonconvex problem, initialisation is an issue. 25
Majorisation-minimisation (MM) Build G ( h | ˜ h ) such that G ( h | ˜ h ) ≥ C ( h ) and G (˜ h | ˜ h ) = C (˜ h ). Optimise (iteratively) G ( h | ˜ h ) instead of C ( h ). 0.5 Objective function C(h) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 3 26
Majorisation-minimisation (MM) Build G ( h | ˜ h ) such that G ( h | ˜ h ) ≥ C ( h ) and G (˜ h | ˜ h ) = C (˜ h ). Optimise (iteratively) G ( h | ˜ h ) instead of C ( h ). 0.5 Objective function C(h) 0.45 Auxiliary function G(h|h (0) ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 3 h (1) h (0) 26
Majorisation-minimisation (MM) Build G ( h | ˜ h ) such that G ( h | ˜ h ) ≥ C ( h ) and G (˜ h | ˜ h ) = C (˜ h ). Optimise (iteratively) G ( h | ˜ h ) instead of C ( h ). 0.5 Objective function C(h) 0.45 Auxiliary function G(h|h (1) ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 3 h (2) h (1) h (0) 26
Majorisation-minimisation (MM) Build G ( h | ˜ h ) such that G ( h | ˜ h ) ≥ C ( h ) and G (˜ h | ˜ h ) = C (˜ h ). Optimise (iteratively) G ( h | ˜ h ) instead of C ( h ). 0.5 Objective function C(h) 0.45 Auxiliary function G(h|h (2) ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 3 h (3) h (2) h (1) h (0) 26
Majorisation-minimisation (MM) Build G ( h | ˜ h ) such that G ( h | ˜ h ) ≥ C ( h ) and G (˜ h | ˜ h ) = C (˜ h ). Optimise (iteratively) G ( h | ˜ h ) instead of C ( h ). 0.5 Objective function C(h) 0.45 Auxiliary function G(h|h * ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 3 h * h (3) h (2) h (1) h (0) 26
Majorisation-minimisation (MM) ◮ Finding a good & workable local majorisation is the crucial point. ◮ Treating convex and concave terms separately with Jensen and tangent inequalities usually works. E.g.: �� � �� �� �� v f C IS ( h ) = + log w fk h k + cst � k w fk h k f f k 27
Majorisation-minimisation (MM) ◮ Finding a good & workable local majorisation is the crucial point. ◮ Treating convex and concave terms separately with Jensen and tangent inequalities usually works. E.g.: �� � �� �� �� v f C IS ( h ) = + log w fk h k + cst � k w fk h k f f k ◮ In most cases, leads to nonnegativity-preserving multiplicative algorithms: � γ h k C (˜ � ∇ − h ) h k = ˜ h k h k C (˜ ∇ + h ) ◮ ∇ h k C ( h ) = ∇ + h k C ( h ) − ∇ − h k C ( h ) and the two summands are nonnegative. ◮ if ∇ h k C (˜ h ) > 0, ratio of summands < 1 and h k goes left. ◮ γ is a divergence-specific scalar exponent. ◮ Details in (F´ evotte and Idier, 2011; Yang and Oja, 2011; Zhao and Tan, 2018) 27
Recommend
More recommend