learning with low rank approximations
play

Learning with Low Rank Approximations or how to use near - PowerPoint PPT Presentation

Learning with Low Rank Approximations or how to use near separability to extract content from structured data Jeremy E. Cohen IRISA, INRIA, CNRS, University of Rennes, France 30 April 2019 1/34 1 Introduction: separability and matrix/tensor


  1. Learning with Low Rank Approximations or how to use near separability to extract content from structured data Jeremy E. Cohen IRISA, INRIA, CNRS, University of Rennes, France 30 April 2019 1/34

  2. 1 Introduction: separability and matrix/tensor rank 2 Semi-supervised learning: dictionary-based matrix and tensor factorization 3 Complete dictionary learning for blind source separation 4 Joint factorization models: some facts, and the linearly coupled case 2/34

  3. Separability: a fundamental property Definition: Separability Let f : R m 1 × R m 2 × R m 3 → R , m i ∈ N . Map f is said to be separable if there exist real maps f 1 , f 2 , f 3 so that f ( x, y, z ) = f 1 ( x ) f 2 ( y ) f 3 ( z ) Of course, any order (i.e. number of variables) is fine. Examples: ( xyz ) n = x n y n z n , e x + y = e x e y , � �� � � � �� y h ( x ) g ( y ) dxdy = x h ( x ) dx y g ( y ) dy x Some usual function are not separable, but are written as a few separable ones! • cos( a + b ) = cos( a ) cos( b ) − sin( a ) sin( b ) • log( xy ) = log( x ) 1 y ∈ R + 1 x ∈ R log( y ) 3/34

  4. Some tricks on separability A fun case: exponential can be seen as separable for any given order. n Let y 1 ( x ) , y 2 ( x ) , . . . , y n ( x ) s.t. x = � y i ( x ) for all x ∈ R , i n e x = � e y i ( x ) i =1 Indeed, for any x , setting y 1 , y n as new variables, e x = e y 1 + y 2 + y 3 + ... + y n := f ( y 1 , . . . , y n ) Then f is not a separable function of � i y i , but it is a separable function of y i : f ( y 1 , y 2 , . . . , y n ) = e y 1 e y 2 . . . e y n = f 1 ( y 1 ) f 2 ( y 2 ) . . . f n ( y n ) Conclusion: description of the inputs matters ! 4/34

  5. Separability and matrix rank Now what about discrete spaces? ( x, y, z ) → { ( x i , y j , z k ) } i ∈ I,j ∈ J,k ∈ K → Values of f are contained in a tensor T ijk = f ( x i , y j , z k ) . Example: e x i is a vector of size I . Let us set x i = i for i ∈ { 0 , 1 , 2 , 3 } . e 0 e 0 e 0     � e 0 � e 0 e 1 e 0 e 1 � �     ⊗ K  =  :=  e 2   e 2 e 0  e 2 e 1   e 3 e 2 e 1 Here, this means that a matricized vector of exponential is a rank one matrix. � e 0 � e 0 e 1 � � � e 0 e 1 � = e 2 e 3 e 2 Setting i = j 2 1 + k 2 0 , f ( j, k ) = e 2 j + k is separable in ( j, k ) . Conclusion: A rank-one matrix can be seen as a separable function on a grid. 5/34

  6. Tensor rank?? We can also introduce a third-order tensor here: e 0 e 0 e 0 e 0     e 1 e 0 e 0 e 1      e 2   e 0 e 2 e 0  � e 0 � e 0 � e 0      e 3   e 0 e 2 e 1  � � �     ⊗ K ⊗ K = =  e 4   e 4 e 0 e 0  e 4 e 2 e 1     e 5 e 4 e 0 e 1          e 6   e 4 e 2 e 0      e 7 e 4 e 2 e 1 By “analogy” with matrices, we say that a tensor is rank-one if it is the discretization of a separable function. 6/34

  7. From separability to matrix/tensor rank From now on, we identify a function f ( x i , y j , z k ) with a three-way array T i,j,k . Definition: rank-one tensor A tensor T i,j,k ∈ R I × J × K is said to be a [decomposable] [separable] [simple] [rank-one] tensor iff there exist a ∈ R I , b ∈ R J , c ∈ R K so that T i,j,k = a i b j c k or equivalently, T = a ⊗ b ⊗ c where ⊗ is a multiway equivalent of the exterior product a ⊗ b = ab t . What matters in practice may be to find the right description of the inputs !! (i.e. how to build the tensor) = f ( x, y, z, t, . . . ) = T a ⊗ b ⊗ c 7/34

  8. ALL tensor decomposition models are based on separability CPD: = + · · · + T = � r q =1 a q ⊗ b q ⊗ c q = a 1 ⊗ b 1 ⊗ c 1 + · · · + a r ⊗ b r ⊗ c r T Tucker: r 1 ,r 2 ,r 3 � g q 1 q 2 q 3 a q 1 ⊗ b q 2 ⊗ c q 3 T = q 1 ,q 2 ,q 3 =1 Hierarchical decompositions: for another talk, sorry :( Definition: tensor [CP] rank (also applies for other decompositions) rank ( T ) = min { r | T = � r q =1 a q ⊗ b q ⊗ c q } Tensor CP rank coincides with matrix “usual” rank! (on board) 8/34

  9. If I were in the audience, I would be wondering: • Why should I care?? → I will tell you now. • Even if I cared, I have no idea how to know my data is somehow separable or a low-rank tensor! → I don’t know, this is the difficult part but at least you may think about separability in the future. → It will probably not be low rank, but it may be approximately low rank! 9/34

  10. Making use of low-rank representations Let A = [ a 1 , a 2 , . . . , a r ] , B and C similarly built. Uniqueness of the CPD Under mild conditions krank ( A ) + krank ( B ) + krank ( C ) − 2 ≥ 2 r, (1) the CPD of T is essentially unique (i.e.) the rank-one terms are unique. This means we can interpret the rank-one terms a q , b q , c q → Source Separation! Compression (also true for other models) The CPD involves r ( I + J + K − 2) parameters, while T contains IJK entries. If the rank is small, this means huge compression/dimentionality reduction! • missing values completion, denoising • function approximation • imposing sparse structure to solve other problems (PDE, neural networks, dictionary learning. . . ) 10/34

  11. Approximate CPD r • Often, T ≈ � a q ⊗ b q ⊗ c q for small r . q • However, the generic rank (i.e. rank of random tensor) is very large. • Therefore if T = � r q a q ⊗ b q ⊗ c q + N with N some small Gaussian noise, it has approximatively rank lower than r but its exact rank is large. Best low-rank approximate CPD For a given rank r , the cost function r � a q ⊗ b q ⊗ c q � 2 η ( A, B, C ) = �T − F q =1 has the following properties: • it is infinitely differentiable. • it is non-convex in ( A, B, C ) , but quadratic in A and B and C . • its minimum may not be attained (ill-posed problem). My favorite class of algorithms to solve aCPD: block-coordinate descent! 11/34

  12. Example: Spectral unmixing for Hyperspectral image processing 1 Pixels can contain several materials → unmixing! 2 Spectra and Abundances are nonnegative! 3 Few materials, many wavelengths 12/34

  13. Spectral unmixing, separability and nonnegative matrix factorization One material q has separable intensity: I q ( x, y, λ ) = w q ( λ ) h q ( x, y ) where w q is a spectrum characteristic to material q , and h q is its abundance map. Therefore, for an image M with r materials, r � I ( x, y, λ ) = w q ( λ ) h q ( x, y ) q =1 This means the measurement matrix M i,j = ˜ I ( pixel i , λ j ) is low rank! Nonnegative matrix factorization r � q � 2 w q h t � M − argmin F W ≥ 0 ,H ≥ 0 q =1 where M i,j = M ([ x ⊗ K y ] i , λ j ) is the vectorized hyperspectral image. Conclusion: I have tensor data, but matrix model! Tensor data � = Tensor model 13/34

  14. ProblemS 1 How to deal with the semi-supervised settings? • Dictionary-based CPD [C., Gillis 2017] • Multiple Dictionaries [C., Gillis 2018] 2 Blind is hard! E.g., NMF is often not identifiable. • Identifiability of Complete Dictionary Learning [C., Gillis 2019] • Algorithms with sparse NMF [C., Gillis 2019] 3 What about dealing with several data set (Hyper-Multispectral, time data)? • Coupled decompositions with flexible couplings. (Maybe in further discussions) 14/34

  15. Semi-supervised Learning with LRA 15/34

  16. A boom in available resources Nowdays, source separation may not need to be blind! Hyperspectral images: • Toy data with ground truth: Urban, Idian Pines. . . • Massive ammount of data: AVIRIS NextGen • Free spectral librairies: ECOSTRESS How to use the power of blind methods for supervised learning? This talk Pre-trained dictionaries are available Many other problems (TODO) • Test and Training joint factorization. • Mixing matrix pre-training with domain adaptation. • Learning with low-rank operators. 16/34

  17. Using dictionaries guaranties interpretability λ NMF 600 600 400 400 600 300 X 400 400 400 200 200 200 200 200 200 100 0 0 0 0 0 0 Y 0 100 0 100 0 100 0 100 0 100 0 100 Spectral band index r d ≫ r ? A D Idea: Impose A ≈ D (: , K ) , # K = R . = M = D (: , K ) B 17/34

  18. sparse coding and 1-sparse coding 1st order model (sparse coding): r � m = λ q d s q = q =1 = D (: , K ) λ D ˜ = λ = for m ∈ R m , s q in [1 , d ] , λ q ∈ R and d s q ∈ D , K = { s q , q ∈ [1 , r ] } . 2d order model (collaborative sparse coding): r = � = d s q ⊗ b q M q =1 D (: , K ) B = = D ˜ = B 18/34

  19. Tensor sparse coding Tensor 1-sparse coding [C., Gillis 17,18] r � T = d s q ⊗ b q ⊗ c q q =1 • Generalizes easily to any order. • Alternating algorithms can be adapted easily. Low memory requirement. • Can be adapted for multiple atom selection (future works). 19/34

Recommend


More recommend