DISI, Universit` a di Genova Genova, October 30 2004 CBCL, Massachusetts Institute of Technology Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1
Plan • Regularized least-squares (RLS) in statistical learning • Bounds on the expected risk and model selection • Evaluating the approximation and the sample errors • Fast rates of convergence of the risk to its minimum 2
Training sets • The set Z = X × Y , with the input space X a compact R n and the output space Y a compact in I in I R. • The probability measure ρ on the space Z . • The training set z = (( x 1 , y 1 ) , · · · , ( x ℓ , y ℓ )), a sequence of ℓ independent identically distributed elements in Z . 3
Regression using RLS The estimator f λ z is defined as the unique hypothesis min- imizing the sum of loss and complexity ℓ 1 ( f ( x i ) − y i ) 2 + λ � f � 2 � f λ , z = argmin H ℓ f ∈H i =1 where • the hypothesis space H is the reproducing kernel Hilbert space (RKHS) with kernel K : X × X → I R, • the parameter λ tunes the balance between the two terms. 4
A criterion for model selection In the context of RLS, a criterion for model selection is represented by a rule to choose λ in order to achieve high performance. The performance of the estimator f λ z is measured by the expected risk � z ( x ) − y ) 2 dρ ( x, y ) . I [ f λ X × Y ( f λ z ] = • It is a random variable, • it depends on the unknown distribution ρ . 5
A criterion for model selection (cont.) The best we can do is determining a function B ( λ, η, ℓ ) which bounds with confidence level 1 − η the expected risk I [ f λ z ], that is � � I [ f λ Prob z ] ≤ inf f ∈H I [ f ] + B ( λ, η, ℓ ) ≥ 1 − η. z ∈ Z ℓ Then, a natural criterion for model selection consists of the choice for the regularization parameter minimizing this bound λ 0 ( η, ℓ ) = argmin {B ( λ, η, ℓ ) } . λ> 0 6
Main contributions in the literature • Model selection performed by bounds using covering numbers as a measure of capacity of a compact hypothesis space [F.Cucker, S. Smale, 2001, 2002] • Use of stability of the estimator and concentration inequalities as tools to bound the risk [O. Bousquet, A. Elisseeff, 2000] • Direct estimates of integral operators by concentration inequal- ities, no need of covering numbers [E. De Vito et al. , 2004] • Use of a Bernstein form of McDiarmid concentration inequality to improve the rates [S. Smale, D. Zhou, 2004] 7
A concentration inequality (McDiarmid, 1989) • Let ξ be a random variable, ξ : Z ℓ → I R, • let z i be the training set with the i th example replaced by ( x ′ i , y ′ i ), • assume that there is a constant C such that | ξ ( z ) − ξ ( z i ) | for all z , z i , i ≤ C then McDiarmid inequality tells us that � � − 2 ǫ 2 Prob z ∈ Z ℓ ( | ξ ( z ) − E z ( ξ ) | ≥ ǫ ) ≤ 2exp . ℓC 2 8
A Bernstein form of McDiarmid inequality (Y. Ying, 2004) • Bounding both variations | ξ ( z ) − E i ξ ( z i ) | ≤ C for all z , i • and variances E i ( ξ ( z ) − E i ξ ( z i ) ) 2 σ 2 for all z , i ≤ it holds � � ǫ 2 Prob z ∈ Z ℓ ( | ξ ( z ) − E z ( ξ ) | ≥ ǫ ) ≤ 2exp − . 2( Cǫ/ 3 + ℓσ 2 ) 9
Structure of the bound 2 I [ f λ z ] ≤ f ∈H I [ f ] inf + A ( λ ) + S ( λ, η, ℓ ) . � �� � � �� � approximation err . sample err . � �� � irreducible err . • The irreducible error is a measure of the intrinsic ran- domness of the outputs y for a drawn input x . • The approximation error A ( λ ) is a measure of the increase of risk due to the regularization. • The sample error S ( λ, η, ℓ ) is a measure of the increase of risk due to finite sampling. 10
The bound on the sample error It can be proved that, given 0 < η < 1 and λ > 0, with probability at least 1 − η , the sample error is bounded by � � � � κMC η ( ℓλ ) − 1 1 + κ ( ℓλ ) − 1 1 + κ 2 C η ℓ − 1 2 λ − 1 S ( λ, η, ℓ ) = , 2 2 where the constants M , κ and C η are defined by Y ⊂ [ − M, M ] , κ 2 ≥ K ( x, x ) for all x , 1 − 4 � = 3 log η + − 8 log η . C η 11
The approximation error It can be proved that � � � f λ − f ρ A ( λ ) = � � � L 2 ( X,ν ) , where � • f ρ ( x ) = Y ydρ ( y | x ) is the regression function • f λ is the RLS estimator in the limit case of infinite sampling , that is f λ = argmin { I [ f ] + λ � f � 2 H } . f ∈H • ν is the marginal distribution of ρ on the input space X . 12
Bounding the approximation error It is well known that bounding the approximation error requires some assumption on the distribution ρ . • Let us denote by L K the integral operator on L 2 ( X, ν ) defined by � ( L K f )( s ) = X K ( s, x ) f ( x ) dν ( x ) . • Assuming that the regression function f ρ belongs to the range of the operator ( L K ) r (for some r ∈ (0 , 1]), then A ( λ ) ≤ C r λ r . 13
Rates of convergence Given the explicit form for the bound on the expected risk, the associated optimal choice for λ can be directly com- puted. It results that λ 0 ( ℓ ) = O ( ℓ − α ), where 2 for 0 < r ≤ 1 2 r +3 2 α = 1 for 1 2 < r ≤ 1 2 r +1 this choice implies the following convergence rate of the risk to its minimum I [ f λ z ] − inf f ∈H I [ f ] ≤ O ( ℓ − β ), where 4 r for 0 < r ≤ 1 2 r +3 2 β = 2 r for 1 2 < r ≤ 1 2 r +1 14
Fast rates Under the maximum regularity assumption r = 1 ( f ρ belonging to the range of L K ) these results give the optimal rate O ( ℓ − 2 I [ f λ 3 log 1 /η ) z ] − inf f ∈H I [ f ] ≤ This improves • the rate in [T.Zhang 2003] in its dependency on the confidence level η from O ( η − 1 ) to logarithmic. • and the rate in [S.Smale, D.Zhou 2004] from O ( ℓ − 1 / 2 ) to O ( ℓ − 2 / 3 ) dependency. 15
The degree of ill-posedness of L K We will assume the following decay condition on the eigenvalues σ 2 i of the integral operator L K , for some p ≥ 1 σ 2 C p i − p . ≤ i • The parameter p is known as the degree of ill-posedness of the operator L K . • This condition can be related to the smoothness prop- erties of the kernel K and the marginal probability den- sity. 16
Improved bound on the sample error Defined the function � � 1 � � C p p � κC η ( ℓλ ) − 1 κ ( ℓλ ) − 1 p 2 + � Θ( λ, η, ℓ ) = , 2 p − 1 λ and given λ , η and ℓ such that Θ( λ, η, ℓ ) ≤ 1, then with probability at least 1 − η , the sample error is bounded by � � 1 2 κC r C η λ r − 1 2 ℓ − 1 1 + κ ( ℓλ ) − 1 (1 − Θ) − 1 S ( λ, η, ℓ ) = 2 2 � � 1 + 1 1 + M κ − 1 λ 2Θ(1 − Θ) − 1 2 Θ . 17
Improved rates of convergence The new bound can be used to obtain improved rates of convergence when 1 2 < r ≤ 1, in fact in this case p O ( ℓ − α ) λ 0 ( ℓ ) = with α = 2 rp + 1 and correspondingly 2 rp I [ f λ O ( ℓ − β ) z ] − inf f ∈H I [ f ] with β = ≤ 2 rp + 1 . For large p the found convergence rate approaches O ( ℓ − 1 ). 18
Conclusions • The estimate of the sample error S ( λ, η, ℓ ) does not require using covering numbers as a capacity measure of the hypothesis space, • under the assumption of exponential decay of the eigen- values of L K , rates arbitrarily close to O ( ℓ − 1 ) can be achieved, • due to the logarithmic dependence on the confidence in the expression of the bounds, convergence results hold almost surely and not just in probability . 19
Recommend
More recommend