spectral regularization methods for statistical inverse
play

Spectral regularization methods for statistical inverse learning - PowerPoint PPT Presentation

Spectral regularization methods for statistical inverse learning problems G. Blanchard Universtit at Potsdam van Dantzig seminar, 23/06/2016 Joint work with N. M ucke (U. Potsdam) Rates for statistical inverse learning van Dantzig


  1. Spectral regularization methods for statistical inverse learning problems G. Blanchard Universtit¨ at Potsdam van Dantzig seminar, 23/06/2016 Joint work with N. M¨ ucke (U. Potsdam) Rates for statistical inverse learning van Dantzig seminar 24/06/2016 1 / 38 G. Blanchard

  2. General regularization and kernel methods 1 Inverse learning/regression and relation to kernels 2 Rates for linear spectral regularization methods 3 Beyond the regular spectrum case 4 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 2 / 38 G. Blanchard

  3. General regularization and kernel methods 1 Inverse learning/regression and relation to kernels 2 Rates for linear spectral regularization methods 3 Beyond the regular spectrum case 4 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 3 / 38 G. Blanchard

  4. I NTRODUCTION : RANDOM DESIGN REGRESSION ◮ Consider the familiar regression setting on a random design, Y i = f ∗ ( X i ) + ε i , where ( X i , Y i ) 1 ≤ i ≤ n is an i.i.d. sample from P XY on the space X × R , ◮ with E [ ε i | X i ] = 0. ◮ For an estimator � f we consider the prediction error function, �� � 2 � � f − f ∗ � 2 � � �� � f ( X ) − f ∗ ( X ) � 2 , X = E , which we want to be as small as possible (in expectation or with large probability). ◮ We can also be interested in squared reconstruction error � f − f ∗ � 2 � � �� � H where H is a certain Hilbert norm of interest for the user. Rates for statistical inverse learning van Dantzig seminar 24/06/2016 4 / 38 G. Blanchard

  5. L INEAR CASE ◮ Very classical is the linear case: X = R p , f ∗ ( x ) = � x , β ∗ � , and in usual matrix form ( X t i form the lines of the design matrix X ) Y = X β ∗ + ε ◮ ordinary least squares solution is � β OLS = ( X t X ) † X t Y . �� � 2 � β ∗ − � ◮ Prediction error corresponds to E β, X � � 2 � � � β ∗ − � ◮ Reconstruction error corresponds to β � . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 5 / 38 G. Blanchard

  6. E XTENDING THE SCOPE OF LINEAR REGRESSION ◮ Common strategy to model more complex functions: map input variable x ∈ X to a so-called “feature space” through � x = Φ( x ) ◮ typical examples (say with X = [ 0 , 1 ] ) are x = Φ( x ) = ( 1 , x , x 2 , . . . , x p ) ∈ R p + 1 ; � x = Φ( x ) = ( 1 , cos ( 2 π x ) , sin ( 2 π x ) , cos ( 3 π x ) , sin ( 3 π x ) , . . . ) ∈ R 2 p + 1 . � ◮ Problem: large number of parameters to estimate require regularization to avoid overfitting. Rates for statistical inverse learning van Dantzig seminar 24/06/2016 6 / 38 G. Blanchard

  7. R EGULARIZATION METHODS ◮ Main idea of regularization is to replace ( X t X ) † by an approximate inverse, for instance ◮ Ridge regression/Tikhonov : � β Ridge ( λ ) = ( X t X + λ I p ) − 1 X t Y ◮ PCA projection/spectral cut-off : restrict X t X on its k first eigenvectors β PCA ( k ) = ( X t X ) † � | k X t Y ◮ Gradient descent/Landweber Iteration/ L 2 boosting : β LW ( k ) = � � β LW ( k − 1 ) + X t ( Y − X � β LW ( k − 1 ) ) k � ( I − X t X ) k X t Y , = i = 0 � � � X t X � op ≤ 1). (assuming Rates for statistical inverse learning van Dantzig seminar 24/06/2016 7 / 38 G. Blanchard

  8. G ENERAL FORM SPECTRAL LINEARIZATION ◮ General form regularization method: � β Spec ( ζ,λ ) = ζ λ ( X t X ) X t Y for somme well-chosen function ζ λ : R + → R + acting on the spectrum and “approximating” the function x �→ 1 / x . ◮ λ > 0: regularization parameter; λ → 0 ⇔ less regularization ◮ Notation of functional calculus, i.e. X t X = Q T diag ( λ 1 , . . . , λ p ) Q → ζ ( X t X ) := Q T diag ( ζ ( λ 1 ) , . . . , ζ ( λ p )) Q ◮ Many well-known from the inverse problem literature ◮ Examples: ◮ Tikhonov : ζ λ ( t ) = ( t + λ ) − 1 ◮ Spectral cut-off : ζ λ ( t ) = t − 1 1 { t ≥ λ } i = 0 ( 1 − t ) i . ◮ Landweber iteration : ζ k ( t ) = � k Rates for statistical inverse learning van Dantzig seminar 24/06/2016 8 / 38 G. Blanchard

  9. C OEFFICIENT EXPANSION ◮ A useful trick of functional calculus is the “ shift rule ”: ζ ( X t X ) X t = X t ζ ( XX t ) . ◮ Interpretation : n � � β Spec ( ζ,λ ) = ζ ( X t X ) X t Y = X t ζ ( XX t ) Y = α i X i , � i = 1 with α i = ζ ( G ) Y , � and G = XX t is the ( n , n ) Gram matrix of ( X 1 , . . . , X n ) . ◮ This representation is more economical if p ≫ n . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 9 / 38 G. Blanchard

  10. T HE “ KERNELIZATION ” A NSATZ ◮ Let Φ be a feature mapping into a (possibly infinite dimensional) Hilbert feature space H . ◮ Representing � x = Φ( x ) ∈ H explicitly is cumbersome/impossible in practice, but if we can compute quickly the kernel � � x , � K ( x , x ′ ) := = � Φ( x ) , Φ( x ′ ) � , � x ′ �� � then kernel Gram matrix � x i , � G ij = x j = K ( x i , x j ) is accessible. ◮ We can hence directly “kernelize” any classical regularization technique using the implicit representation n � � α i � α i = ζ ( � β Spec ( ζ,λ ) = � X i , � G ) Y , i = 1 � � � ◮ the value of f ( x ) = β, � x can then be computed for any x : n � f ( x ) = α i K ( X i , x ) . � i = 1 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 10 / 38 G. Blanchard

  11. R EPRODUCING KERNEL METHODS ◮ If H is a Hilbert feature space, it is useful to identify it as a space of real functions on X of the form f ( x ) = � w , Φ( x ) � . The canonical feature mapping is then Φ( x ) = K ( x , . ) and the “reproducing kernel” property reads f ( x ) = � f , Φ( x ) � = � f , K ( x , . ) � . ◮ Classical kernels on R d include ◮ Gaussian Kernel K ( x , y ) = exp − � x − y � 2 / 2 σ 2 ◮ Polynomial Kernel K ( x , y ) = ( 1 + � x , y � ) p ◮ Spline kernels, Mat´ ern kernel, inverse quadratic kernel. . . ◮ Success of reproducing kernel methods since early 00’s is due to their versatility and ease of use : beyond vector spaces, kernels have been constructed on various non-euclidean data (text, genome, graphs, probability distributions. . . ) ◮ One of the tenets of “learning theory” is a distribution-free point of view ; in particular the sampling distribution (of the X i s) is unknown to the user and could be very general. Rates for statistical inverse learning van Dantzig seminar 24/06/2016 11 / 38 G. Blanchard

  12. General regularization and kernel methods 1 Inverse learning/regression and relation to kernels 2 Rates for linear spectral regularization methods 3 Beyond the regular spectrum case 4 Rates for statistical inverse learning van Dantzig seminar 24/06/2016 12 / 38 G. Blanchard

  13. S ETTING : “I NVERSE L EARNING ” PROBLEM ◮ We refer to “inverse learning” (or inverse regression) for an inverse problem where we have noisy observations at random design points : Y i = ( Af ∗ )( X i ) + ε i . ( X i , Y i ) i = 1 ,..., n i.i.d. : (ILP) ◮ the goal is to recover f ∗ ∈ H 1 . ◮ early works on closely related subjects: from the splines literature in the 80’s (e.g. O’Sullivan ’90) Rates for statistical inverse learning van Dantzig seminar 24/06/2016 13 / 38 G. Blanchard

  14. M AIN ASSUMPTION FOR I NVERSE L EARNING Y i = ( Af ∗ )( X i ) + ε i , i = 1 , . . . , n , where A : H 1 → H 2 . Model: (ILP) Observe: ◮ H 2 should be a space of real-values functions on X . ◮ the geometrical structure of the “measurement errors” will be dictated by the statistical properties of the sampling scheme – no need to assume or consider any a priori Hilbert structure on H 2 ◮ crucial stuctural assumption is the following: Assumption The family of evaluation functionals ( S x ) , x ∈ X , defined by S x : H 1 − → R f �− → ( S x )( f ) := ( Af )( x ) is uniformly bounded, i.e., there exists κ < ∞ such that for any x ∈ X | S x ( f ) | ≤ κ � f � H 1 . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 14 / 38 G. Blanchard

  15. G EOMETRY OF INVERSE LEARNING ◮ The inverse learning under the previous assumption was essentially considered by Caponnetto et al. (2006). ◮ Riesz’s theorem implies the existence for any x ∈ X of F x ∈ H 1 : ∀ f ∈ H 1 : ( Af )( x ) = � f , F x � ◮ K ( x , y ) := � F x , F y � defines a positive semidefinite kernel on X with associated reproducing kernel Hilbert space (RKHS) denoted H K . ◮ as a pure function space, H K coincides with Im ( A ) . ◮ assuming A injective, A is in fact an isometric isomorphism between H 1 and H K . Rates for statistical inverse learning van Dantzig seminar 24/06/2016 15 / 38 G. Blanchard

Recommend


More recommend