Thresholding and Learning theory Dominique Picard Laboratoire Probabilit´ es et Mod` eles Al´ eatoires Universit´ es Paris VII Joint work with G. Kerkyacharian (LPMA) Columbia- SC May 2008. http ://www.proba.jussieu.fr/mathdoc/preprints/index.html 1
Bounded regression/learning problem : Model 1. Y i = f ρ ( X i ) + ǫ i , i = 1 . . . n 2. ǫ ′ i s, i.i.d. bounded random variables 3. X ′ i s i.i.d. random variables on a set X = compact domain of R d . Let ρ be the common (unknown ) law of the vector Z = ( X, Y ) 4. f ρ is a bounded unknown function. 5. Two kind of hypotheses (a) f ρ ( X i ) orthogonal to ǫ i (learning) (b) X i ⊥ ⊥ ε i (bounded regression theory) Cucker and Smale, Poggio and Smale,.. 2
Aim of the game 1. Minimize among ’estimators’ ^ f = ^ f ( x, ( X, Y ) n 1 ) � E (^ f ) := E ρ (^ X × R (^ f ( x ) − y ) 2 dρ ( x, y ) f ) := 2. � f ρ ( x ) = ydρ ( y | x ) 3. E (^ f ) = � ^ f − f ρ � 2 ρ X + err ( f ρ ) � � 4. E (^ X (^ f ( x ) − f ρ ( x )) 2 dρ X ( x ) + X × R ( f ρ ( x ) − y ) 2 dρ ( x, y ) f ) = 3
Measuring the risk E ρ ⊗ n � ^ 1. Mean square error : f (( X, Y ) n 1 ) − f ρ � ρ X P ρ ⊗ n { � ^ f (( X, Y ) n 2. Probability bounds : 1 ) − f ρ � ρ X > η } 4
Mean square Errors and Probability bounds – Assume f ρ belongs to a set Θ , ρ ∈ M ( Θ ) consider the Accuracy Confidence Function : – P ρ ⊗ n { � f ρ − ^ AC n ( Θ, η ) := inf sup f � ρ X > η } ^ f ρ ∈M ( Θ ) – AC n ( Θ, η ) ≥ C { e − cnη 2 , η ≥ η n , 1, η ≤ η n , DeVore, Kerkyacharian, P, Temlyakov 5
• AC n ( Θ, η ) ≥ C { e − cnη 2 , η ≥ η n , η ≤ η n , 1, • ln ¯ N ( Θ, η n ) ∼ c 2 nη 2 n • ¯ N ( Θ, δ ) := sup { N : ∃ f 0 , f 1 , ...f N ∈ Θ, with c 0 δ ≤ � f i − f j � L 2 ( ρ X ) ≤ c 1 δ, ∀ i � = j } . 6
– f � > η } ≥ C { e − cnη 2 , η ≥ η n , P ρ ⊗ n { � f ρ − ^ inf sup ^ η ≤ η n , 1, f ρ ∈M ( Θ ) s – η n = n − 2s + d for the Besov space B s q ( L ∞ ( R d )) – In statistics, minimax results s E � f ρ − ^ f � dx ≥ cn − inf sup 2s + d ^ f ρ ∈M ′ ( B s q ( L ∞ ( R d ))) Ibraguimov, Hasminski, Stone 80-82 7
Mean square estimates n � f = Argmin { 1 ^ ( Y i − f ( X i )) 2 , f ∈ H n } n i = 1 1. 2 important problems : (a) Not always easy to implement (b) depending on Θ : Search for ’Universal’ estimates : working for a class of spaces Θ 8
Oracle Case n � ( P ) : 1 K k ( X i ) K l ( X i ) = δ kl n i = 1 ( ( K k ) o.n.b. for the empirical measure on the X ′ i s ) = { f = � p 1. H ( 1 ) j = 1 α j K j } (linear) n j = 1 α j K j , � | α j | ≤ κ } = { f = � p 2. H ( 2 ) n ( l 1 constraint) = { f = � p 3. H ( 3 ) j = 1 α j K j , # {| α j | � = 0 } ≤ κ } n (sparsity) 9
� n α k = 1 ^ i = 1 K k ( X i ) Y i , n α ( 1 ) α ( 2 ) ^ = sign (^ α k ) | ^ α k − λ | + , ^ = ^ α k I {| ^ α k | ≥ λ } k k = { f = � p 1. H ( 1 ) j = 1 α j K j } n f = � p ^ . j = 1 ^ α j K j j = 1 α j K j , � | α j | ≤ κ } = { f = � p 2. H ( 2 ) n f ( 1 ) = � p α ( 1 ) ^ . j = 1 ^ K j j = { f = � p 3. H ( 3 ) j = 1 α j K j # {| α j | � = 0 } ≤ κ } n f ( 2 ) = � p α ( 2 ) ^ . j = 1 ^ K j j 10
Universality properties � n α k = 1 ^ i = 1 K k ( X i ) Y i , n α ( 1 ) α ( 2 ) ^ = sign (^ α k ) | ^ α k − λ | + , ^ = ^ α k I {| ^ α k | ≥ λ } k k f ( 1 ) = � p f ( 2 ) = � p α ( 1 ) α ( 2 ) ^ ^ j = 1 ^ j = 1 ^ K j , K j j j 11
How to mimic the oracle ? � n 1. Condition ( P ) : 1 i = 1 K r ( X i ) K l ( X i ) = δ rl is not realistic. n 2. How to replace ( P ) by P ( δ ) ′ δ − close ′ to ( P ) ? 12
Consider for instance the sparsity penalty We want to minimize : p n � � C ( α ) := 1 α j K j ( X i )) 2 + λ # { α j � = 0 } ( Y i − n i = 1 j = 1 = 1 n � Y − K t α � 2 2 + λ # { α j � = 0 } = 1 2 + 1 n � Y − proj V ( Y ) � 2 n � proj V ( Y ) − K t α � 2 2 + λ # { α j � = 0 } V = { ( � p j = 1 b j K j ( X i )) n i = 1 , b j ∈ R } , K ji = K j ( X i ) p × n matrix 13
Case λ = 0 C ( α ) = 1 2 + 1 n � Y − proj V ( Y ) � 2 n � proj V ( Y ) − K t α � 2 2 . K t ^ α = proj V ( Y ) K t ^ α = K t ( KK t ) − 1 KY α = ( KK t ) − 1 KY ^ Regression text-books 14
Case λ � = 0 C ( α ) = 1 2 + 1 n � Y − proj V ( Y ) � 2 n � proj V ( Y ) − K t α � 2 2 + λ # { α j � = 0 } Minimize C ( α ) equivalent to minimize D ( α ) D ( α ) = 1 n � proj V ( Y ) − K t α � 2 2 + λ # { α j � = 0 } α ) t 1 nKK t ( α − ^ = ( α − ^ α ) + λ # { α j � = 0 } 15
� n Condition ( P ) : 1 i = 1 K r ( X i ) K l ( X i ) = δ rl n • then the p × p matrix M np = 1 nKK t = Id n � ( M np ) kl = ( 1 K l ( X i ) K k ( X i )) kl n i = 1 • D ( α ) = � p α j ) 2 + λ # { α j � = 0 } j = 1 ( α j − ^ α ( 2 ) has ^ = ^ α k I {| ^ α k | ≥ cλ } as a solution. k α = ( KK t ) − 1 KY = 1 • Simplicity of calculation : ^ n KY p � α j = 1 ^ K j ( X i ) Y i n j = 1 16
δ -Near Identity property M np = 1 nKK t p p � � x 2 j ≤ x t M np x ≤ ( 1 + δ ) x 2 ( 1 − δ ) j j = 1 j = 1 p p p ( 1 − δ ) sup | x j | ≤ sup | ( M np x ) j | ≤ ( 1 + δ ) sup | x j | j = 1 j = 1 j = 1 17
Estimation procedure λ n = T √ t n , log n n 1 = p = [ log n ] t n , 2 n ( z 1 , . . . , z p ) t = ( KK t ) − 1 KY, z = ˜ = z l I {| z l | ≥ λ n } z l p � ^ ˜ f = z l K l ( · ) l = 1 18
Results 1. If f ρ is sparse i.e. ∃ 0 < q < 2, ∀ p, ∃ ( α 1 , . . . , α p ) (a) � f ρ − � p j = 1 α j K j � ∞ ≤ Cp − 1 (b) ∀ λ > 0, # {| α l | ≥ λ } ≤ Cλ − q , η n = [ log n 2 − q 1 4 . ] n ρ > ( 1 − δ ) − 1 η } ≤ T { e − cnp − 1 η 2 ∧ n − γ , η ≥ Dη n , ρ { � f ρ − ^ f � ^ η ≤ Dη n , 1, Quasi-optimality 19
1. Our conditions depend on the family of functions { K j , j ≥ 1 } . 2. If the K j ’s can be tensor products of wavelet bases for instance then for s := d q − d 2 s r ( L ∞ ( R d )) implies the conditions above and η n = n − f ∈ B s 2s + d . 20
Near Identity property : How to make it work ? d = 1 1. Take { φ k , k ≥ 1 } be a smooth orthonormal basis of L 2 [ 0, 1 ]( dx ) 2. H with H ( X i ) = i n 3. Change the time scale : K k = φ k ( H ) � n � n 4. P n ( k, l ) = 1 i = 1 K k ( X i ) K l ( X i ) = 1 i = 1 φ k ( i n ) φ l ( i n ) ∼ δ kl n n 21
1.5 1.0 0.5 0.0 x −0.5 −1.5 0 10 20 30 40 Index Fig. 1 – Ordering by arrival times 22
1.5 1.0 0.5 0.0 s −0.5 −1.5 0 10 20 30 40 Index Fig. 2 – Sorting 23
Choosing H • Ordering the X ′ i s : ( X 1 , . . . , X n ) → ( X ( 1 ) ≤ . . . ≤ X ( n ) ) � n • Consider ^ G n ( x ) = 1 i = 1 I { X i ≤ x } n • ^ G n ( X ( i ) ) = i n • H = ^ G n is stable (i.e. close to G ( x ) = ρ ( X ≤ x ) ) • φ l ( ^ G n ) ∼ φ l ( G ) 24
Near Identity property d ≥ 2 Finding H such that H ( X i ) = ( i 1 n , . . . , i d n ) , for instance in a ’stable way’ is a difficult problem. 25
Near Identity property K 1 , . . . , K p NIP if there exist a measure µ and cells C 1 , . . . , C N such that : � | K l ( x ) K r ( x ) dµ ( x ) − δ lr | ≤ δ 1 ( l, r ) � N � | 1 K l ( ξ i ) K r ( ξ i ) − K l ( x ) K r ( x ) dµ ( x ) | ≤ δ 2 ( l, r ) , N i = 1 ∀ ξ 1 ∈ C 1 , . . . , ξ N ∈ C N p � [ δ 1 ( l, r ) + δ 2 ( l, r )] ≤ δ r = 1 26
Examples : Tensor products of bases, uniform cells 1. d = 1 , µ Lebesgue measure, on [ 0, 1 ] , K 1 , . . . , K p is a smooth orthonormal basis (Fourier, wavelet,...) δ 1 = 0, δ 2 ( l, r ) = p N . • � p r = 1 δ 2 ( l, r ) ≤ p 2 1 1 N N ≤ c log N := δ for p = [ log N ] 2 √ ( p ≤ δN is enough) 2. d > 1 , µ Lebesgue measure, on [ 0, 1 ] d K 1 , . . . , K p tensor products of the previous basis. N = m d , p = Γ d . sup ( 1,H ( l,r )) δ 1 = 0, δ 2 ( l, r ) = [ p N ] d H ( l, r ) = � l = ( l 1 , . . . , l d ) , r = ( r 1 , . . . , r d ) , i ≤ d I { l i � = r i } • � p r = 1 δ 2 ( l, r ) ≤ [ p 2 1 1 d = c N N ] d := δ for p ∼ [ log N ] 2 1 [ log N ] √ ( p ≤ δ d N is enough) 27
How to relate these assumptions with the near Identity condition ? What we have here : N � 1 K l ( ξ i ) K r ( ξ i ) ξ 1 ∈ C 1 , . . . , ξ N ∈ C N ’not too far from’ δ lr N i = 1 What we want n � 1 K l ( X i ) K r ( X i ) ’not too far from’ δ lr n i = 1 28
2 1 y 0 −1 −2 −2 −1 0 1 2 x 29
2 1 0 y −1 −2 −2 −1 0 1 2 x Fig. 3 – Typical situation 30
2 1 0 y −1 −2 −2 −1 0 1 2 x 31
2 1 0 y −1 −2 −2 −1 0 1 2 x 32
Procedure 1. We choose cells C l such that there exist at least one among the observation points X i ’s in each cell. 2. We keep only one data point in each cell. (reducing the set of observation : ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , → ( X 1 , Y 1 ) , . . . , ( X N , Y N ) 1 3. n − → N , δ ∼ log N near identity property. 4. If ρ X is absolutely continuous with respect to µ , with density n lower and upper bounded, then N ∼ [ log n ] with overwhelming probability. 33
Recommend
More recommend