Learning with Low Rank Approximations or how to use near separability to extract content from structured data Jeremy E. Cohen IRISA, INRIA, CNRS, University of Rennes, France 30 April 2019 1/34
1 Introduction: separability and matrix/tensor rank 2 Semi-supervised learning: dictionary-based matrix and tensor factorization 3 Complete dictionary learning for blind source separation 4 Joint factorization models: some facts, and the linearly coupled case 2/34
Separability: a fundamental property Definition: Separability Let f : R m 1 × R m 2 × R m 3 → R , m i ∈ N . Map f is said to be separable if there exist real maps f 1 , f 2 , f 3 so that f ( x, y, z ) = f 1 ( x ) f 2 ( y ) f 3 ( z ) Of course, any order (i.e. number of variables) is fine. Examples: ( xyz ) n = x n y n z n , e x + y = e x e y , � �� � � � �� y h ( x ) g ( y ) dxdy = x h ( x ) dx y g ( y ) dy x Some usual function are not separable, but are written as a few separable ones! • cos( a + b ) = cos( a ) cos( b ) − sin( a ) sin( b ) • log( xy ) = log( x ) 1 y ∈ R + 1 x ∈ R log( y ) 3/34
Some tricks on separability A fun case: exponential can be seen as separable for any given order. n Let y 1 ( x ) , y 2 ( x ) , . . . , y n ( x ) s.t. x = � y i ( x ) for all x ∈ R , i n e x = � e y i ( x ) i =1 Indeed, for any x , setting y 1 , y n as new variables, e x = e y 1 + y 2 + y 3 + ... + y n := f ( y 1 , . . . , y n ) Then f is not a separable function of � i y i , but it is a separable function of y i : f ( y 1 , y 2 , . . . , y n ) = e y 1 e y 2 . . . e y n = f 1 ( y 1 ) f 2 ( y 2 ) . . . f n ( y n ) Conclusion: description of the inputs matters ! 4/34
Separability and matrix rank Now what about discrete spaces? ( x, y, z ) → { ( x i , y j , z k ) } i ∈ I,j ∈ J,k ∈ K → Values of f are contained in a tensor T ijk = f ( x i , y j , z k ) . Example: e x i is a vector of size I . Let us set x i = i for i ∈ { 0 , 1 , 2 , 3 } . e 0 e 0 e 0 � e 0 � e 0 e 1 e 0 e 1 � � ⊗ K = := e 2 e 2 e 0 e 2 e 1 e 3 e 2 e 1 Here, this means that a matricized vector of exponential is a rank one matrix. � e 0 � e 0 e 1 � � � e 0 e 1 � = e 2 e 3 e 2 Setting i = j 2 1 + k 2 0 , f ( j, k ) = e 2 j + k is separable in ( j, k ) . Conclusion: A rank-one matrix can be seen as a separable function on a grid. 5/34
Tensor rank?? We can also introduce a third-order tensor here: e 0 e 0 e 0 e 0 e 1 e 0 e 0 e 1 e 2 e 0 e 2 e 0 � e 0 � e 0 � e 0 e 3 e 0 e 2 e 1 � � � ⊗ K ⊗ K = = e 4 e 4 e 0 e 0 e 4 e 2 e 1 e 5 e 4 e 0 e 1 e 6 e 4 e 2 e 0 e 7 e 4 e 2 e 1 By “analogy” with matrices, we say that a tensor is rank-one if it is the discretization of a separable function. 6/34
From separability to matrix/tensor rank From now on, we identify a function f ( x i , y j , z k ) with a three-way array T i,j,k . Definition: rank-one tensor A tensor T i,j,k ∈ R I × J × K is said to be a [decomposable] [separable] [simple] [rank-one] tensor iff there exist a ∈ R I , b ∈ R J , c ∈ R K so that T i,j,k = a i b j c k or equivalently, T = a ⊗ b ⊗ c where ⊗ is a multiway equivalent of the exterior product a ⊗ b = ab t . What matters in practice may be to find the right description of the inputs !! (i.e. how to build the tensor) = f ( x, y, z, t, . . . ) = T a ⊗ b ⊗ c 7/34
ALL tensor decomposition models are based on separability CPD: = + · · · + T = � r q =1 a q ⊗ b q ⊗ c q = a 1 ⊗ b 1 ⊗ c 1 + · · · + a r ⊗ b r ⊗ c r T Tucker: r 1 ,r 2 ,r 3 � g q 1 q 2 q 3 a q 1 ⊗ b q 2 ⊗ c q 3 T = q 1 ,q 2 ,q 3 =1 Hierarchical decompositions: for another talk, sorry :( Definition: tensor [CP] rank (also applies for other decompositions) rank ( T ) = min { r | T = � r q =1 a q ⊗ b q ⊗ c q } Tensor CP rank coincides with matrix “usual” rank! (on board) 8/34
If I were in the audience, I would be wondering: • Why should I care?? → I will tell you now. • Even if I cared, I have no idea how to know my data is somehow separable or a low-rank tensor! → I don’t know, this is the difficult part but at least you may think about separability in the future. → It will probably not be low rank, but it may be approximately low rank! 9/34
Making use of low-rank representations Let A = [ a 1 , a 2 , . . . , a r ] , B and C similarly built. Uniqueness of the CPD Under mild conditions krank ( A ) + krank ( B ) + krank ( C ) − 2 ≥ 2 r, (1) the CPD of T is essentially unique (i.e.) the rank-one terms are unique. This means we can interpret the rank-one terms a q , b q , c q → Source Separation! Compression (also true for other models) The CPD involves r ( I + J + K − 2) parameters, while T contains IJK entries. If the rank is small, this means huge compression/dimentionality reduction! • missing values completion, denoising • function approximation • imposing sparse structure to solve other problems (PDE, neural networks, dictionary learning. . . ) 10/34
Approximate CPD r • Often, T ≈ � a q ⊗ b q ⊗ c q for small r . q • However, the generic rank (i.e. rank of random tensor) is very large. • Therefore if T = � r q a q ⊗ b q ⊗ c q + N with N some small Gaussian noise, it has approximatively rank lower than r but its exact rank is large. Best low-rank approximate CPD For a given rank r , the cost function r � a q ⊗ b q ⊗ c q � 2 η ( A, B, C ) = �T − F q =1 has the following properties: • it is infinitely differentiable. • it is non-convex in ( A, B, C ) , but quadratic in A and B and C . • its minimum may not be attained (ill-posed problem). My favorite class of algorithms to solve aCPD: block-coordinate descent! 11/34
Example: Spectral unmixing for Hyperspectral image processing 1 Pixels can contain several materials → unmixing! 2 Spectra and Abundances are nonnegative! 3 Few materials, many wavelengths 12/34
Spectral unmixing, separability and nonnegative matrix factorization One material q has separable intensity: I q ( x, y, λ ) = w q ( λ ) h q ( x, y ) where w q is a spectrum characteristic to material q , and h q is its abundance map. Therefore, for an image M with r materials, r � I ( x, y, λ ) = w q ( λ ) h q ( x, y ) q =1 This means the measurement matrix M i,j = ˜ I ( pixel i , λ j ) is low rank! Nonnegative matrix factorization r � q � 2 w q h t � M − argmin F W ≥ 0 ,H ≥ 0 q =1 where M i,j = M ([ x ⊗ K y ] i , λ j ) is the vectorized hyperspectral image. Conclusion: I have tensor data, but matrix model! Tensor data � = Tensor model 13/34
ProblemS 1 How to deal with the semi-supervised settings? • Dictionary-based CPD [C., Gillis 2017] • Multiple Dictionaries [C., Gillis 2018] 2 Blind is hard! E.g., NMF is often not identifiable. • Identifiability of Complete Dictionary Learning [C., Gillis 2019] • Algorithms with sparse NMF [C., Gillis 2019] 3 What about dealing with several data set (Hyper-Multispectral, time data)? • Coupled decompositions with flexible couplings. (Maybe in further discussions) 14/34
Semi-supervised Learning with LRA 15/34
A boom in available resources Nowdays, source separation may not need to be blind! Hyperspectral images: • Toy data with ground truth: Urban, Idian Pines. . . • Massive ammount of data: AVIRIS NextGen • Free spectral librairies: ECOSTRESS How to use the power of blind methods for supervised learning? This talk Pre-trained dictionaries are available Many other problems (TODO) • Test and Training joint factorization. • Mixing matrix pre-training with domain adaptation. • Learning with low-rank operators. 16/34
Using dictionaries guaranties interpretability λ NMF 600 600 400 400 600 300 X 400 400 400 200 200 200 200 200 200 100 0 0 0 0 0 0 Y 0 100 0 100 0 100 0 100 0 100 0 100 Spectral band index r d ≫ r ? A D Idea: Impose A ≈ D (: , K ) , # K = R . = M = D (: , K ) B 17/34
sparse coding and 1-sparse coding 1st order model (sparse coding): r � m = λ q d s q = q =1 = D (: , K ) λ D ˜ = λ = for m ∈ R m , s q in [1 , d ] , λ q ∈ R and d s q ∈ D , K = { s q , q ∈ [1 , r ] } . 2d order model (collaborative sparse coding): r = � = d s q ⊗ b q M q =1 D (: , K ) B = = D ˜ = B 18/34
Tensor sparse coding Tensor 1-sparse coding [C., Gillis 17,18] r � T = d s q ⊗ b q ⊗ c q q =1 • Generalizes easily to any order. • Alternating algorithms can be adapted easily. Low memory requirement. • Can be adapted for multiple atom selection (future works). 19/34
Recommend
More recommend