Thresholding and Learning theory Dominique Picard Laboratoire - PowerPoint PPT Presentation

Thresholding and Learning theory Dominique Picard Laboratoire Probabilit´ es et Mod` eles Al´ eatoires Universit´ es Paris VII Joint work with G. Kerkyacharian (LPMA) Columbia- SC May 2008. http ://www.proba.jussieu.fr/mathdoc/preprints/index.html 1

Bounded regression/learning problem : Model 1. Y i = f ρ ( X i ) + ǫ i , i = 1 . . . n 2. ǫ ′ i s, i.i.d. bounded random variables 3. X ′ i s i.i.d. random variables on a set X = compact domain of R d . Let ρ be the common (unknown ) law of the vector Z = ( X, Y ) 4. f ρ is a bounded unknown function. 5. Two kind of hypotheses (a) f ρ ( X i ) orthogonal to ǫ i (learning) (b) X i ⊥ ⊥ ε i (bounded regression theory) Cucker and Smale, Poggio and Smale,.. 2

Aim of the game 1. Minimize among ’estimators’ ^ f = ^ f ( x, ( X, Y ) n 1 ) � E (^ f ) := E ρ (^ X × R (^ f ( x ) − y ) 2 dρ ( x, y ) f ) := 2. � f ρ ( x ) = ydρ ( y | x ) 3. E (^ f ) = � ^ f − f ρ � 2 ρ X + err ( f ρ ) � � 4. E (^ X (^ f ( x ) − f ρ ( x )) 2 dρ X ( x ) + X × R ( f ρ ( x ) − y ) 2 dρ ( x, y ) f ) = 3

Measuring the risk E ρ ⊗ n � ^ 1. Mean square error : f (( X, Y ) n 1 ) − f ρ � ρ X P ρ ⊗ n { � ^ f (( X, Y ) n 2. Probability bounds : 1 ) − f ρ � ρ X > η } 4

Mean square Errors and Probability bounds – Assume f ρ belongs to a set Θ , ρ ∈ M ( Θ ) consider the Accuracy Confidence Function : – P ρ ⊗ n { � f ρ − ^ AC n ( Θ, η ) := inf sup f � ρ X > η } ^ f ρ ∈M ( Θ ) – AC n ( Θ, η ) ≥ C { e − cnη 2 , η ≥ η n , 1, η ≤ η n , DeVore, Kerkyacharian, P, Temlyakov 5

• AC n ( Θ, η ) ≥ C { e − cnη 2 , η ≥ η n , η ≤ η n , 1, • ln ¯ N ( Θ, η n ) ∼ c 2 nη 2 n • ¯ N ( Θ, δ ) := sup { N : ∃ f 0 , f 1 , ...f N ∈ Θ, with c 0 δ ≤ � f i − f j � L 2 ( ρ X ) ≤ c 1 δ, ∀ i � = j } . 6

– f � > η } ≥ C { e − cnη 2 , η ≥ η n , P ρ ⊗ n { � f ρ − ^ inf sup ^ η ≤ η n , 1, f ρ ∈M ( Θ ) s – η n = n − 2s + d for the Besov space B s q ( L ∞ ( R d )) – In statistics, minimax results s E � f ρ − ^ f � dx ≥ cn − inf sup 2s + d ^ f ρ ∈M ′ ( B s q ( L ∞ ( R d ))) Ibraguimov, Hasminski, Stone 80-82 7

Mean square estimates n � f = Argmin { 1 ^ ( Y i − f ( X i )) 2 , f ∈ H n } n i = 1 1. 2 important problems : (a) Not always easy to implement (b) depending on Θ : Search for ’Universal’ estimates : working for a class of spaces Θ 8

Oracle Case n � ( P ) : 1 K k ( X i ) K l ( X i ) = δ kl n i = 1 ( ( K k ) o.n.b. for the empirical measure on the X ′ i s ) = { f = � p 1. H ( 1 ) j = 1 α j K j } (linear) n j = 1 α j K j , � | α j | ≤ κ } = { f = � p 2. H ( 2 ) n ( l 1 constraint) = { f = � p 3. H ( 3 ) j = 1 α j K j , # {| α j | � = 0 } ≤ κ } n (sparsity) 9

� n α k = 1 ^ i = 1 K k ( X i ) Y i , n α ( 1 ) α ( 2 ) ^ = sign (^ α k ) | ^ α k − λ | + , ^ = ^ α k I {| ^ α k | ≥ λ } k k = { f = � p 1. H ( 1 ) j = 1 α j K j } n f = � p ^ . j = 1 ^ α j K j j = 1 α j K j , � | α j | ≤ κ } = { f = � p 2. H ( 2 ) n f ( 1 ) = � p α ( 1 ) ^ . j = 1 ^ K j j = { f = � p 3. H ( 3 ) j = 1 α j K j # {| α j | � = 0 } ≤ κ } n f ( 2 ) = � p α ( 2 ) ^ . j = 1 ^ K j j 10

Universality properties � n α k = 1 ^ i = 1 K k ( X i ) Y i , n α ( 1 ) α ( 2 ) ^ = sign (^ α k ) | ^ α k − λ | + , ^ = ^ α k I {| ^ α k | ≥ λ } k k f ( 1 ) = � p f ( 2 ) = � p α ( 1 ) α ( 2 ) ^ ^ j = 1 ^ j = 1 ^ K j , K j j j 11

How to mimic the oracle ? � n 1. Condition ( P ) : 1 i = 1 K r ( X i ) K l ( X i ) = δ rl is not realistic. n 2. How to replace ( P ) by P ( δ ) ′ δ − close ′ to ( P ) ? 12

Consider for instance the sparsity penalty We want to minimize : p n � � C ( α ) := 1 α j K j ( X i )) 2 + λ # { α j � = 0 } ( Y i − n i = 1 j = 1 = 1 n � Y − K t α � 2 2 + λ # { α j � = 0 } = 1 2 + 1 n � Y − proj V ( Y ) � 2 n � proj V ( Y ) − K t α � 2 2 + λ # { α j � = 0 } V = { ( � p j = 1 b j K j ( X i )) n i = 1 , b j ∈ R } , K ji = K j ( X i ) p × n matrix 13

Case λ = 0 C ( α ) = 1 2 + 1 n � Y − proj V ( Y ) � 2 n � proj V ( Y ) − K t α � 2 2 . K t ^ α = proj V ( Y ) K t ^ α = K t ( KK t ) − 1 KY α = ( KK t ) − 1 KY ^ Regression text-books 14

Case λ � = 0 C ( α ) = 1 2 + 1 n � Y − proj V ( Y ) � 2 n � proj V ( Y ) − K t α � 2 2 + λ # { α j � = 0 } Minimize C ( α ) equivalent to minimize D ( α ) D ( α ) = 1 n � proj V ( Y ) − K t α � 2 2 + λ # { α j � = 0 } α ) t 1 nKK t ( α − ^ = ( α − ^ α ) + λ # { α j � = 0 } 15

� n Condition ( P ) : 1 i = 1 K r ( X i ) K l ( X i ) = δ rl n • then the p × p matrix M np = 1 nKK t = Id n � ( M np ) kl = ( 1 K l ( X i ) K k ( X i )) kl n i = 1 • D ( α ) = � p α j ) 2 + λ # { α j � = 0 } j = 1 ( α j − ^ α ( 2 ) has ^ = ^ α k I {| ^ α k | ≥ cλ } as a solution. k α = ( KK t ) − 1 KY = 1 • Simplicity of calculation : ^ n KY p � α j = 1 ^ K j ( X i ) Y i n j = 1 16

δ -Near Identity property M np = 1 nKK t p p � � x 2 j ≤ x t M np x ≤ ( 1 + δ ) x 2 ( 1 − δ ) j j = 1 j = 1 p p p ( 1 − δ ) sup | x j | ≤ sup | ( M np x ) j | ≤ ( 1 + δ ) sup | x j | j = 1 j = 1 j = 1 17

Estimation procedure λ n = T √ t n , log n n 1 = p = [ log n ] t n , 2 n ( z 1 , . . . , z p ) t = ( KK t ) − 1 KY, z = ˜ = z l I {| z l | ≥ λ n } z l p � ^ ˜ f = z l K l ( · ) l = 1 18

Results 1. If f ρ is sparse i.e. ∃ 0 < q < 2, ∀ p, ∃ ( α 1 , . . . , α p ) (a) � f ρ − � p j = 1 α j K j � ∞ ≤ Cp − 1 (b) ∀ λ > 0, # {| α l | ≥ λ } ≤ Cλ − q , η n = [ log n 2 − q 1 4 . ] n ρ > ( 1 − δ ) − 1 η } ≤ T { e − cnp − 1 η 2 ∧ n − γ , η ≥ Dη n , ρ { � f ρ − ^ f � ^ η ≤ Dη n , 1, Quasi-optimality 19

1. Our conditions depend on the family of functions { K j , j ≥ 1 } . 2. If the K j ’s can be tensor products of wavelet bases for instance then for s := d q − d 2 s r ( L ∞ ( R d )) implies the conditions above and η n = n − f ∈ B s 2s + d . 20

Near Identity property : How to make it work ? d = 1 1. Take { φ k , k ≥ 1 } be a smooth orthonormal basis of L 2 [ 0, 1 ]( dx ) 2. H with H ( X i ) = i n 3. Change the time scale : K k = φ k ( H ) � n � n 4. P n ( k, l ) = 1 i = 1 K k ( X i ) K l ( X i ) = 1 i = 1 φ k ( i n ) φ l ( i n ) ∼ δ kl n n 21

1.5 1.0 0.5 0.0 x −0.5 −1.5 0 10 20 30 40 Index Fig. 1 – Ordering by arrival times 22

1.5 1.0 0.5 0.0 s −0.5 −1.5 0 10 20 30 40 Index Fig. 2 – Sorting 23

Choosing H • Ordering the X ′ i s : ( X 1 , . . . , X n ) → ( X ( 1 ) ≤ . . . ≤ X ( n ) ) � n • Consider ^ G n ( x ) = 1 i = 1 I { X i ≤ x } n • ^ G n ( X ( i ) ) = i n • H = ^ G n is stable (i.e. close to G ( x ) = ρ ( X ≤ x ) ) • φ l ( ^ G n ) ∼ φ l ( G ) 24

Near Identity property d ≥ 2 Finding H such that H ( X i ) = ( i 1 n , . . . , i d n ) , for instance in a ’stable way’ is a difficult problem. 25

Near Identity property K 1 , . . . , K p NIP if there exist a measure µ and cells C 1 , . . . , C N such that : � | K l ( x ) K r ( x ) dµ ( x ) − δ lr | ≤ δ 1 ( l, r ) � N � | 1 K l ( ξ i ) K r ( ξ i ) − K l ( x ) K r ( x ) dµ ( x ) | ≤ δ 2 ( l, r ) , N i = 1 ∀ ξ 1 ∈ C 1 , . . . , ξ N ∈ C N p � [ δ 1 ( l, r ) + δ 2 ( l, r )] ≤ δ r = 1 26

Examples : Tensor products of bases, uniform cells 1. d = 1 , µ Lebesgue measure, on [ 0, 1 ] , K 1 , . . . , K p is a smooth orthonormal basis (Fourier, wavelet,...) δ 1 = 0, δ 2 ( l, r ) = p N . • � p r = 1 δ 2 ( l, r ) ≤ p 2 1 1 N N ≤ c log N := δ for p = [ log N ] 2 √ ( p ≤ δN is enough) 2. d > 1 , µ Lebesgue measure, on [ 0, 1 ] d K 1 , . . . , K p tensor products of the previous basis. N = m d , p = Γ d . sup ( 1,H ( l,r )) δ 1 = 0, δ 2 ( l, r ) = [ p N ] d H ( l, r ) = � l = ( l 1 , . . . , l d ) , r = ( r 1 , . . . , r d ) , i ≤ d I { l i � = r i } • � p r = 1 δ 2 ( l, r ) ≤ [ p 2 1 1 d = c N N ] d := δ for p ∼ [ log N ] 2 1 [ log N ] √ ( p ≤ δ d N is enough) 27

How to relate these assumptions with the near Identity condition ? What we have here : N � 1 K l ( ξ i ) K r ( ξ i ) ξ 1 ∈ C 1 , . . . , ξ N ∈ C N ’not too far from’ δ lr N i = 1 What we want n � 1 K l ( X i ) K r ( X i ) ’not too far from’ δ lr n i = 1 28

2 1 y 0 −1 −2 −2 −1 0 1 2 x 29

2 1 0 y −1 −2 −2 −1 0 1 2 x Fig. 3 – Typical situation 30

2 1 0 y −1 −2 −2 −1 0 1 2 x 31

2 1 0 y −1 −2 −2 −1 0 1 2 x 32

Procedure 1. We choose cells C l such that there exist at least one among the observation points X i ’s in each cell. 2. We keep only one data point in each cell. (reducing the set of observation : ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , → ( X 1 , Y 1 ) , . . . , ( X N , Y N ) 1 3. n − → N , δ ∼ log N near identity property. 4. If ρ X is absolutely continuous with respect to µ , with density n lower and upper bounded, then N ∼ [ log n ] with overwhelming probability. 33

Thresholding and Learning theory Dominique Picard Laboratoire - PowerPoint PPT Presentation

Thresholding and Learning theory Dominique Picard Laboratoire Probabilit es et Mod` eles Al eatoires Universit es Paris VII Joint work with G. Kerkyacharian (LPMA) Columbia- SC May 2008. http

Thresholding of Text Documents Oliver A Nina William A Barrett Thresholding or Binarization

Free Picard Categories Michael Horst The Ohio State University horst.59@osu.edu

PICARD Mission overview Alain Hauchecorne, Grard Thuillier (LATMOS) and the PICARD Team

Ultra-high dimensional statistics and statistical learning on some applications Dominique Picard

Forecasting intraday-load curve using sparse learning methods Dominique Picard LPMA- Universit

Picard categories, determinant functors and K -theory Fernando Muro Universitat de Barcelona,

Words and Automata, Lecture 2 Dominique Perrin 31 octobre 2013 Dominique Perrin Words and

Profinite semigroups Dominique Perrin 4 d ecembre 2015 Dominique Perrin Profinite semigroups

Bakry meets Villani Fabrice Baudoin Purdue University Purdue Probability Seminar Dominique

Profinite semigroups Dominique Perrin 13 novembre 2015 Dominique Perrin Profinite semigroups

How to estimate a density on a spider web ? Dominique Picard How to estimate a density on a

Matrix estimation by Universal Singular Value Thresholding Sourav Chatterjee Courant Institute,

Score Distribution Based Term Specific Thresholding for Spoken Term Detection D. Can M. Sarac

Picard Groups of Stable Module Categories Richard Wong GROOT Summer Seminar 2020 Slides can be

Wavelet-based clustering for mixed-effects functional models in high dimension. Franck Picard,

ON A MODEL IN PORO-ELASTICITY Rainer Picard Technische Universitt Dresden

Wigner function estimation in QHT with noisy data Joint work with Lounici, K. and Peyr e, G.

High Dimensional Predictive Inference Workshop on Current Trends and Challenges in Model

Minimax testing of a composite null hypothesis defined via a quadratic functional Joint work with

Minimax-Angle Learning for Optimal Treatment Decision with Heterogeneous Data Chengchun Shi

DFA Minimization, Pumping Lemma CSCI 3130 Formal Languages and Automata Theory Siu On CHAN

Initial value problems by convex minimization and matrix-valued optimal transport Yann Brenier

Minimization of Beam Emittance Wen Wei Ho Mentor: Chun-xi Wang Rationale and Physics Low

Energy minimization via conic programming hierarchies David de Laat (TU Delft) IFORS July 14,