Machine learning meets super-resolution H. N. Mhaskar Claremont Graduate University, Claremont. Inverse Problems and Machine Learning February 10, 2018
Goals The problem of super-resolution is dual of the problem of machine learning, viewed as function approximation. ◮ How to measure the accuracy ◮ How to ensure lower bounds ◮ Common tools Will illustrate on the (hyper-)sphere S q of R q +1 .
1. Machine learning
Machine learning on S q Given data (training data) of the form D = { ( x j , y j ) } M j =1 , where x j ∈ S q , y j ∈ R , find a function x �→ � N k =1 a k G ( x · z k ) ◮ that models the data well; ◮ in particular, � N k =1 a k G ( x j · z k ) ≈ y j . Tacit assumption: There exists an underlying function f such that y j = f ( x j ) + noise.
ReLU networks An ReLU network is a function of form N � x �→ a k | w k · x + b k | . k =1 ( w k , b ) · ( x , 1) w k · x + b k � ( w k | 2 + 1)( | x | 2 + 1) � Approximation on Euclidean space � approximation on sphere
Notation on the sphere S q := { x = ( x 1 , . . . , x q +1 ) : � q +1 k =1 x 2 k = 1 } , ω q = Riemannian volume of S q ρ ( x , y ) = geodesic distance between x and y . Π q n = class of all spherical polynomials of degree at most n . H q ℓ = class of all homogeneous harmonic polynomials of degree ℓ , d q ℓ = the dimension of H q ℓ , { Y ℓ, k } = orthonormal basis for H q ℓ . ∆ = Negative Laplace-Beltrami operator. ∆ Y ℓ, k = ℓ ( ℓ + q − 1) Y ℓ, k = λ 2 ℓ Y ℓ, k .
Notation on the sphere With p ℓ = p ( q / 2 − 1 , q / 2 − 1) (Jacobi polynomial), ℓ d q ℓ � Y ℓ, k ( x ) Y ℓ, k ( y ) = ω − 1 q − 1 p ℓ (1) p ℓ ( x · y ) . k =1 If G : [ − 1 , 1] → R , d q ∞ ℓ � ˆ � G ( x · y ) = G ( ℓ ) Y ℓ, k ( x ) Y ℓ, k ( y ) . ℓ =0 k =1 For a measure µ on S q , � µ ( ℓ, k ) = ˆ S q Y ℓ, k ( y ) d µ ( y ) .
Notation on the sphere n � λ ℓ � Φ n ( t ) = ω − 1 � p ℓ (1) p ℓ ( t ) . h q − 1 n ℓ =0 � d q n ℓ � � λ ℓ � � σ n ( µ )( x ) = S q Φ n ( x · y ) d µ ( y ) = ˆ µ ( ℓ, k ) Y ℓ, k ( x ) . h n ℓ =0 k =1 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Notation on the sphere Localization (Mh. 2004) If S > q and h is sufficiently smooth, n q | Φ n ( x · y ) | ≤ c ( h , s ) max(1 , ( n ρ ( x · y )) S )
Polynomial approximation (Mh. 2004) E n ( f ) = min � f − P � ∞ . P ∈ Π q n W r = { f ∈ C ( S q ) : E n ( f ) = O ( n − r ) } . Theorem TFAE 1. f ∈ W r 2. � f − σ n ( f ) � = O ( n − r ) 3. � σ 2 n ( f ) − σ 2 n − 1 ( f ) � = O (2 − nr ) (Littlewood-Paley type expansion)
Data-based approximation For C = { x j } ⊂ S q , D = { ( x j , y j ) } M j =1 , 1. Find N and w j ∈ R such that M � P ∈ Π q � w j P ( x j ) = S q P ( x ) d x , 2 N j =1 and M � P ∈ Π q � | w j P ( x j ) | ≤ c S q | P ( x ) | d x , 2 N . j =1 Done by least squares or least residual solutions, to ensure a good condition number. 2. M � S N ( D )( x ) = w j y j Φ N ( x · x j ) j =1
Data-based approximation (Le Gia, Mh., 2008) If { x j } M j =1 are chosen uniformly from µ q , and f ∈ W r , then with high probability, � f − S N ( D ) � ∞ � M − r / (2 r + q ) . If f is locally in W r , then the results holds locally as well; i.e., accuracy in approximation adapts itself according to local smoothness.
Examples f ( x , y , z ) = [0 . 01 − ( x 2 + y 2 + ( z − 1) 2 )] + + exp( x + y + z ) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 − 14 − 12 − 10 − 8 − 6 − 4 − 2 Percentages of error less than 10 x Least square, σ 63 ( h 1 ), σ 63 ( h 5 ).
Examples f ( x , y , z ) = ( x − 0 . 9) 3 / 4 + ( z − 0 . 9) 3 / 4 + + 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 − 11 − 10 − 9 − 8 − 7 − 6 − 5 − 4 − 3 − 2 Percentages of error less than 10 x Least square, σ 63 ( h 1 ), σ 63 ( h 5 ).
Examples East–west component of earth’s magnetic field Original data on left (Courtesy Dr. Thorsten Maier), reconstruction with σ 46 ( h 7 ) on right
ZF networks Let ˆ G ( ℓ ) ∽ ℓ − β , β > q , C m a nested sequence of points with δ ( C m ) = max x ∈ S q min z ∈C m ρ ( x , z ) ∼ η ( C m ) = z 1 � = z 2 ∈C m ρ ( z 1 , z 2 ) ≥ 1 / m . min G ( C m ) = span { G ( ◦ · z ) : z ∈ C m } .
ZF networks (Mh. 2010) Theorem Let 0 < r < β − q , then f ∈ W r if and only if dist( f , G ( C m )) = O ( m − r ) , Remark. The theorem gives lower limits for individual functions.
One problem x j ’s may not be distributed according to µ q ; their distribution is unknown.
Drusen classification ◮ AMD (Age related Macular Degeneration) is the most common cause of blindness among the elderly in the western world. ◮ AMD � RPE (Retinal Pigment Epithelium) � Drusen accumulation of different kinds Problem: Automated quantitative prediction of disease progression, based on drusen classification.
Drusen classification (Ehler, Filbir, Mh., 2012) We used 24 images (400 × 400 pixels each) on each patient, at different frequencies. By preprocessing these images at each pixel, we obtained a data set consisting of 160,000 points on a sphere in a 5 dimensional Euclidean space. We used about 1600 of these as training set, and classified the drusen in 4 classes. While the current practice is based on spatial appearance, our method is based on multi–spectral information.
Drusen classification
2. Super-resolution
Problem statement Given observations of the form L � a m exp( − ijx m ) + noise , | j | ≤ N , m =1 determine L , a m ’s and x m ’s. Hidden periodicities (Lanczos) Direction finding (Krim, Pillai, · · · ) Singularity detection (Eckhoff, Gelb, Tadmor, Tanner, Mh., Prestin, Batenkov, · · · ) Parameter estimation (Potts, Tasche, Filbir, Mh., Prestin, · · · ) Blind source signal separation (Flandrin, Daubeschies, Wu, Chui, Mh., · · · )
A simple observation If Φ N is a highly localized kernel (Mh.-Prestin, 1998), then � L m =1 a m Φ N ( x − x m ) ≈ � L m =1 a m δ x m .
A simple observation Original signal: f ( t ) = cos(2 π t )+cos(2 π (0 . 96) t )+cos(2 π (0 . 92) t )+cos(2 π (0 . 9) t )+ noise
A simple observation Original signal: f ( t ) = cos(2 π t )+cos(2 π (0 . 96) t )+cos(2 π (0 . 92) t )+cos(2 π (0 . 9) t )+ noise Frequencies obtained by our method (Chui, Mh., van der Walt, 2015): .
Super-resolution Question How large should N be? Answer With η = min j � = k | x j − x k | , N ≥ c η − 1 . Super-resolution (Donoho, Cand´ es, Fernandez-Granda) How can we do this problem with N ≪ η − 1 ?
Spherical variant Given L � k = 1 , · · · , d q a m Y ℓ, k ( x m ) + noise , ℓ , 0 ≤ ℓ ≤ N , m =1 determine L , a m , x m . Observation With µ ∗ = � L m =1 a m δ x m , L � ˆ µ ∗ ( ℓ, k ) = a m Y ℓ, k ( x m ) . m =1
Super-duper-resolution Given ˆ k = 1 , · · · , d q µ ∗ ( ℓ, k ) + noise , ℓ , ℓ ≤ N , determine µ ∗ . Remark The minimal separation is 0. Any solution based on finite amount of information is beyond super-resolution.
Duality � d µ N ( x ) = σ N ( µ ∗ )( x ) d x = S q Φ N ( x · y ) d µ ∗ ( y ) d x . For f ∈ C ( S q ), � � S q σ N ( f )( x ) d µ ∗ ( x ) . S q f ( x ) d µ N ( x ) = So, � � � � S q f ( x ) d ( µ N − µ ∗ )( x ) � ≤ | µ ∗ | TV E N / 2 ( f ) . � � � � Thus, µ N → µ ∗ (weak-*). Also, � � S q P ( x ) d µ ∗ ( x ) , P ∈ Π q S q P ( x ) d µ N ( x ) = N / 2 .
Examples (Courtesy: D. Batenkov) Original measure (left), Fourier projection (middle), σ 64 (below left), thresholded | σ 64 | (below right).
Examples (Courtesy: D. Batenkov) Original measure (left), Fourier projection (middle), σ 64 (below).
Examples (Courtesy: D. Batenkov) Original measure (left), Fourier projection (middle), σ 64 (below).
3. Distance between measures
Erd¨ os-Tur´ an discrepancy Erd¨ os, Tur´ an, 1940 If ν is a signed measure on T, ( ∗ ) D [ ν ] = sup | ν ([ a , b ]) | . [ a , b ] ⊂ T Analogues of (*) hard for manifolds, even sphere. Equivalently, if e ikx � G ( x ) = ik k ∈ Z \{ 0 } � � � � � ( ∗∗ ) D [ ν ] = sup G ( x − y ) d ν ( y ) � � x ∈ T � � T Generalization to multivariate case: Dick, Pillisheimer, 2010.
Wasserstein metric �� � � � � � sup S q fd ν � : max x , y ∈ S q | f ( x ) − f ( y ) | ≤ 1 . � � � f Replace max x , y ∈ S q | f ( x ) − f ( y ) | ≤ 1 by � ∆( f ) � ≤ 1. Equivalent metric: � � � � � S q G ( ◦ · y ) d ν ( y ) , � � � � 1 where G is Green kernel for ∆.
Measuring weak-* convergence Let G : [ − 1 , 1] → R , ˆ G ( ℓ ) > 0 for all ℓ , ˆ G ( ℓ ) ∽ ℓ − β , β > q . � � � � � D G [ ν ] = S q G ( ◦ · y ) d ν ( y ) . � � � � 1 Theorem D G [ µ N − µ ∗ ] ≤ cN − β | µ ∗ | TV . Remark The approximating measure is constructed from O ( N q ) pieces of information ˆ µ ∗ ( ℓ, k ). In terms of the amount of information, M , the rate is O ( M − β/ q ).
Recommend
More recommend