on adaptation for the posterior distribution under local
play

On adaptation for the posterior distribution under local and - PowerPoint PPT Presentation

On adaptation for the posterior distribution under local and sup-norm Judith Rousseau, Marc Hoffman and Johannes Schmidt - Hieber ENSAE - CREST et CEREMADE, Universit Paris-Dauphine Brown 1/ 19 Outline Bayesian nonparametric : posterior


  1. On adaptation for the posterior distribution under local and sup-norm Judith Rousseau, Marc Hoffman and Johannes Schmidt - Hieber ENSAE - CREST et CEREMADE, Université Paris-Dauphine Brown 1/ 19

  2. Outline Bayesian nonparametric : posterior concentration 1 Generalities Adaptation Idea of the proof Why adaptation easy : white noise model 2 What about f ( x 0 ) ? or � f − f 0 � ∞ ? 3 A series of negative result 4 2/ 19

  3. Outline Bayesian nonparametric : posterior concentration 1 Generalities Adaptation Idea of the proof Why adaptation easy : white noise model 2 What about f ( x 0 ) ? or � f − f 0 � ∞ ? 3 A series of negative result 4 3/ 19

  4. Generalities ◮ Model : Y n 1 | θ ∼ p n θ (density wrt µ ), θ ∈ Θ A priori : θ ∼ Π : prior distribution − → posterior distribution d Π( θ | X n ) = d Π( θ ) p n θ ( Y n 1 ) Y n , 1 = ( Y 1 , . . . , Y n ) m ( Y n 1 ) ◮ Posterior concentration d ( ., . ) = loss on Θ & θ 0 ∈ Θ = True E θ 0 (Π [ U ǫ n | Y n U ǫ n = { θ ; d ( θ, θ 0 ) ≤ ǫ n } ǫ n ↓ 0 1 ]) = 1 + o ( 1 ) , ◮ Minimax concentration rates on a Class Θ α ( L ) , � � �� U c M ǫ n ( α ) | Y n sup E θ 0 Π = o ( 1 ) , 1 θ 0 ∈ Θ α ( L ) where ǫ n ( α ) = minimax rate under d ( ., . ) & over Θ α ( L ) . 4/ 19

  5. Examples of Models-losses for which nice results exist ◮ Density estimation Y i ∼ p θ i.i.d. ( √ p θ −√ p θ ′ ) 2 ( x ) dx , � � d ( p θ , p θ ′ ) 2 = d ( p θ , p θ ′ ) = | p θ − p θ ′ | ( x ) dx ◮ Regression function ǫ i ∼ N ( 0 , σ 2 ) , Y i = f ( x i ) + ǫ i , θ = ( f , σ ) n � d ( p θ , p θ ′ ) = � f − f ′ � 2 , d ( p θ , p θ ′ ) = n − 1 H 2 ( p θ ( y | X i ) , p θ ′ ( y | X i )) i = 1 H = Hellinger ◮ White noise dY ( t ) = f ( t ) dt + n − 1 / 2 dW ( t ) ⇔ Y i = θ i + n − 1 / 2 ǫ i , i ∈ N d ( p θ , p θ ′ ) = � f − f ′ � 2 5/ 19

  6. Examples : functional classes Θ α ( L ) = Hölder ( H ( α, L ) ) ǫ n ( α ) = n − α/ ( 2 α + 1 ) minimax rate over H ( α, L ) ◮ Density example : Hellinger loss Prior = DPM � f ( x ) = f P ,σ ( x ) = φ σ ( x − µ ) dP ( µ ) , σ ∼ I Γ( a , b ) P ∼ DP ( A , G 0 ) � � �� U c M ( n / log n ) − α/ ( 2 α + 1 ) ( f 0 ) | Y n Π = o ( 1 ) , sup E f 0 1 f 0 ∈ Θ α ( L ) U ǫ ( f 0 ) = { f , h ( f 0 , f ) ≤ ǫ } [ log n term necessary ? ] � f , f 0 ) 2 � h (ˆ ˆ � ( n / log n ) − α/ ( 2 α + 1 ) , f ( x ) = E π [ f ( x ) | Y n ] ⇒ E f 0 6/ 19

  7. Outline Bayesian nonparametric : posterior concentration 1 Generalities Adaptation Idea of the proof Why adaptation easy : white noise model 2 What about f ( x 0 ) ? or � f − f 0 � ∞ ? 3 A series of negative result 4 7/ 19

  8. Adaptation For such d ( ., . ) Adaptation is easy : The prior does not depend on α : � � �� U c M ( n / log n ) − α/ ( 2 α + 1 ) | Y n Π = o ( 1 ) , sup sup E θ 0 1 α 1 ≤ α ≤ α 2 θ 0 ∈ Θ α ( L ) ◮ why ? 8/ 19

  9. Outline Bayesian nonparametric : posterior concentration 1 Generalities Adaptation Idea of the proof Why adaptation easy : white noise model 2 What about f ( x 0 ) ? or � f − f 0 � ∞ ? 3 A series of negative result 4 9/ 19

  10. Outline U n = U M ( n / log n ) − α/ ( 2 α + 1 ) and l n ( θ ) = log p n θ ( Y n 1 ) ǫ n = ( n / log n ) − α/ ( 2 α + 1 ) ¯ n e l n ( θ ) − l n ( θ 0 ) d Π( θ ) � Θ e l n ( θ ) − l n ( θ 0 ) d Π( θ ) := N n U c Π [ U c n | Y n 1 ] = � D n φ n = φ n ( Y n 1 ) ∈ [ 0 , 1 ] � � � ǫ n 2 � 1 ] > e − τ n ǫ 2 Π [ U c n | Y n ≤ E n D n < e − cn ¯ P θ 0 θ 0 [ φ n ] + P θ 0 n � + e ( c + τ ) n ǫ 2 E θ [ 1 − φ n ] d π ( θ ) n U c n 10/ 19

  11. Constraints ǫ n 2 ) → d ( ., . ) E n E θ [ 1 − φ n ] = o ( e − cn ¯ θ 0 [ φ n ] = o ( 1 ) & sup d ( θ,θ 0 ) > M ¯ ǫ n � ǫ n 2 � D n < e − cn ¯ P θ 0 = o ( 1 ) We need : � e l n ( θ ) − l n ( θ 0 ) d Π( θ ) D n ≥ S n � � ≥ e − 2 n ǫ 2 ǫ n 2 } S n ∩ { l n ( θ ) − l n ( θ 0 ) > − 2 n ¯ n Π Ok if S n = { KL ( p n θ 0 , p n ǫ n 2 ; V ( p n θ 0 , p n ǫ n 2 } and θ ) ≤ n ¯ θ ) ≤ n ¯ ǫ n 2 → links d ( ., . ) Π( S n ) ≥ e − cn ¯ with KL ( ., . ) 11/ 19

  12. example : white noise model + L 2 loss Y ik = θ ik + n − 1 / 2 ǫ ik i ∈ N , k ≤ 2 i − 1 ǫ ik ∼ N ( 0 , 1 ) , ( dY ( t ) = f ( t ) dt + n − 1 / 2 dW ( t )) ◮ Hölder class ( α ) θ 0 ∈ { θ ; | θ ik | ≤ Li − α − 1 / 2 , ∀ i , k } ◮ Prior : spike and slab θ ik ∼ ( 1 − p n ) δ ( 0 ) + p n g , e . g . g = N ( 0 , v ) , p n = 1 / n ◮ Concentration S n ≈ {� θ − θ 0 � 2 ≤ ( n / log n ) − 2 α/ ( 2 α + 1 ) } → ∀ j ≥ J n ,α , k ≤ 2 j ; θ j , k = 0 2 J n ,α = ( n / log n ) 1 / ( 2 α + 1 ) := R n , Π( S n ) � e − CR n log n := e − Cn ǫ 2 n E θ [ 1 − φ n ] ≤ e − cn ǫ 2 E θ 0 [ φ n ] = o ( 1 ) , & sup n θ ∈ Θ n ; � θ − θ 0 � � ǫ n 12/ 19

  13. What about f ( x 0 ) ? or � f − f 0 � ∞ ? Y ik = θ ik + n − 1 / 2 ǫ ik θ 0 ∈ { θ ; | θ ik | ≤ Li − α − 1 / 2 , ǫ ik ∼ N ( 0 , 1 ) , ∀ i , k } ◮ Prior : spike and slab θ ik = ( 1 − p n ) δ ( 0 ) + p n g , p n = 1 / n ◮ losses : � ( θ ik − θ o ik ) ψ ik ( x 0 ) 2 i / 2 ) 2 l ( θ, θ 0 ) = ( (local) ik � | θ ik − θ o ik | 2 i / 2 l ( θ, θ 0 ) = � θ − θ 0 � ∞ = max (sup) k i ◮ Bayesian concentration ∀ α > 0, ∃ θ 0 ∈ Θ α ( L ) s.t. � � l ( θ, θ 0 ) ≤ n − ( α − 1 / 2 ) / ( 2 α + 1 ) log n q | Y n �� E θ 0 Π = o ( 1 ) 1 i 0 = ρ n 2 − i / 2 and θ o Sub-optimal θ o ik = 0, i ≤ I n : ∀ J > 0 ik ) 2 ≤ n − 2 α/ ( 2 α + 1 ) , ik | > n − ( α − 1 / 2 ) / ( 2 α + 1 ) log n q � � � ( θ o | θ o max k i > J k i > J 13/ 19

  14. Risk ? Y ik = θ ik + n − 1 / 2 ǫ ik θ 0 ∈ { θ ; | θ ik | ≤ Li − α − 1 / 2 , ǫ ik ∼ N ( 0 , 1 ) , ∀ i , k } • Prior : θ ik = ( 1 − p n ) δ ( 0 ) + p n g , p n = 1 / n • Suboptimal concentration BUT ˆ θ = E π [ θ | Y n ] ( n / log n ) 2 α/ ( 2 α + 1 ) sup � � E n l (ˆ lim sup sup θ, θ 0 ) < + ∞ θ 0 n α 1 ≤ α ≤ α 2 θ 0 ∈ Θ α Questions 14/ 19

  15. Risk ? Y ik = θ ik + n − 1 / 2 ǫ ik θ 0 ∈ { θ ; | θ ik | ≤ Li − α − 1 / 2 , ǫ ik ∼ N ( 0 , 1 ) , ∀ i , k } • Prior : θ ik = ( 1 − p n ) δ ( 0 ) + p n g , p n = 1 / n • Suboptimal concentration BUT ˆ θ = E π [ θ | Y n ] ( n / log n ) 2 α/ ( 2 α + 1 ) sup � � E n l (ˆ lim sup sup θ, θ 0 ) < + ∞ θ 0 n α 1 ≤ α ≤ α 2 θ 0 ∈ Θ α Questions ◮ Question 1 How general is this (negative) result ? 14/ 19

  16. Risk ? Y ik = θ ik + n − 1 / 2 ǫ ik θ 0 ∈ { θ ; | θ ik | ≤ Li − α − 1 / 2 , ǫ ik ∼ N ( 0 , 1 ) , ∀ i , k } • Prior : θ ik = ( 1 − p n ) δ ( 0 ) + p n g , p n = 1 / n • Suboptimal concentration BUT ˆ θ = E π [ θ | Y n ] ( n / log n ) 2 α/ ( 2 α + 1 ) sup � � E n l (ˆ lim sup sup θ, θ 0 ) < + ∞ θ 0 n α 1 ≤ α ≤ α 2 θ 0 ∈ Θ α Questions ◮ Question 1 How general is this (negative) result ? ◮ Question 2 What does it tell us about posterior concentration ? 14/ 19

  17. A first general result H ( α 1 , L ) ∪ H ( α 2 , L ) ⊂ Θ , α 1 < α 2 ◮ Local loss l ( θ, θ 0 ) = ( θ ( x ) − θ 0 ( x )) 2 Result : There exist no prior that leads to adaptive minimax concentration over any collection of Hölder balls : ∀ π prior on Θ , ∀ M > 0 � � l ( θ, θ 0 ) > Mn − 2 α j / ( 2 α j + 1 ) | Y n �� max sup E θ 0 Π = 1 j θ 0 ∈H ( α j , L ) • What do we loose ? ◮ L ∞ and local loss If ∃ θ 0 ∈ Θ , � � l ( θ, θ 0 ) > Mn − 2 α 2 / ( 2 α 2 + 1 ) | Y n � > e − n τ � P θ 0 Π = o ( 1 ) , τ > 0 Then worse � � l ( θ, θ 0 ) > n − ( 2 α j − τ ) / ( 2 α j + 1 ) | Y n �� max sup E θ 0 Π = 1 j θ 0 ∈H ( α j , L ) 15/ 19

  18. Still not completely satisfying • For local loss : If we could find a prior with only log n loss then who cares ! • L ∞ loss : Smaller than e − n τ to be expected because of test can we be more precise ? Slightly 16/ 19

  19. Another negative result H ( α 1 , L ) ∪ H ( α 2 , L ) ⊂ Θ , α 1 < α 2 ǫ n ( α ) = ( n / log n ) − α/ ( 2 α + 1 ) If there exists θ 0 ∈ H ( α 2 , L ) and 2 J n ,α 2 = ( n / log n ) 1 / ( 2 α 2 + 1 ) . π ( � θ − θ 0 � 2 ≤ c ǫ n ( α 2 )) � e − n ǫ 2 n ( α 2 ) Then there ∃ θ 1 ∈ H ( α 1 , L ) E θ 1 (Π [ l ( θ, θ 0 ) >> ǫ n ( α 1 ) | Y n ]) = 1 17/ 19

  20. Another negative result H ( α 1 , L ) ∪ H ( α 2 , L ) ⊂ Θ , α 1 < α 2 ǫ n ( α ) = ( n / log n ) − α/ ( 2 α + 1 ) If there exists θ 0 ∈ H ( α 2 , L ) and 2 J n ,α 2 = ( n / log n ) 1 / ( 2 α 2 + 1 ) . π ( � θ − θ 0 � 2 ≤ c ǫ n ( α 2 )) � e − n ǫ 2 n ( α 2 )    ≤ e − Bn ǫ 2  � � θ 2 jk > A ǫ n ( α 2 ) 2 n ( α 2 ) π j ≥ J n ,α 2 k Then there ∃ θ 1 ∈ H ( α 1 , L ) E θ 1 (Π [ l ( θ, θ 0 ) >> ǫ n ( α 1 ) | Y n ]) = 1 17/ 19

Recommend


More recommend