machine learning theory
play

Machine learning theory Regression Hamid Beigy Sharif university - PowerPoint PPT Presentation

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020 Table of contents 1. Introduction 2. Generalization bounds 3. Pseudo-dimension bounds 4. Regression algorithms 5. Summary 1/35 Introduction The


  1. Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

  2. Table of contents 1. Introduction 2. Generalization bounds 3. Pseudo-dimension bounds 4. Regression algorithms 5. Summary 1/35

  3. Introduction

  4. The problem of regression ◮ Let X denote the input space and Y a measurable subset of R and D be a distribution over X × Y . ◮ Learner receives sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } ∈ ( X × Y ) m drawn i.i.d. according to D . ◮ Let L : X × Y �→ R + be the loss function used to measure the magnitude of error. ◮ The most used loss function is ◮ L 2 defined as L ( y , y ′ ) = | y ′ − y | 2 for all y , y ′ ∈ Y , ◮ or more generally L p defined as L ( y , y ′ ) = | y ′ − y | p for all p ≥ 1 and y , y ′ ∈ Y , ◮ The regression problem is defined as Definition (Regression problem) Given a hypothesis set H = { h : X �→ Y | h ∈ H } , regression problem consists of using labeled sample S to find a hypothesis h ∈ H with small generalization error R ( h ) respect to target f : R ( h ) = ( x , y ) ∼D [ L ( h ( x ) , y )] E The empirical loss or error of h ∈ H is denoted by m R ( h ) = 1 ˆ � L ( h ( x i ) , y i ) m i =1 ◮ If L ( y , y ) ≤ M for all y , y ′ ∈ Y , problem is called bounded regression problem. 2/35

  5. Generalization bounds

  6. Finite hypothesis sets Theorem (Generalization bounds for finite hypothesis sets) Let L ≤ M be a bounded loss function and the hypothesis set H is finite. Then, for any δ > 0 , with probability at least (1 − δ ) , the following inequality holds for all h ∈ H � � log | H | + log 1 � � δ R ( h ) ≤ ˆ R ( h ) + M . 2 m Proof (Generalization bounds for finite hypothesis sets). By Hoeffding’s inequality, since L ∈ [0 , M ], for any h ∈ H , the following holds − 2 m ǫ 2 � � � � R ( h ) − ˆ R ( h ) > ǫ ≤ exp . P M 2 Thus, by the union bound, we can write � � � � � � R ( h ) − ˆ � R ( h ) − ˆ ∃ h ∈ H R ( h ) > ǫ ≤ R ( h ) > ǫ P � P h ∈ H − 2 m ǫ 2 � � ≤ | H | exp . M 2 Setting the right-hand side to be equal to δ , the theorem will proved. 3/35

  7. Rademacher complexity bounds Theorem (Rademacher complexity of µ -Lipschitz loss functions) Let L ≤ M be a bounded loss function such that for any fixed y ′ ∈ Y , L ( y , y ′ ) is µ -Lipschitz for some µ > 0 . Then for any sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } , the upper bound of the Rademacher complexity of the family G = { ( x , y ) �→ L ( h ( x ) , y ) | h ∈ H } is R ( G ) ≤ µ ˆ ˆ R ( H ) . Proof (Rademacher complexity of µ -Lipschitz loss functions). Since for any fixed y i , L ( y , y ′ ) is µ -Lipschitz for some µ > 0, by Talagrand’s Lemma, we can write � m � R ( G ) = 1 � ˆ σ i L ( h ( x i ) , y i ) m E σ i =1 � m � ≤ 1 � σ i µ h ( x i ) m E σ i =1 = µ ˆ R ( H ) . 4/35

  8. Rademacher complexity bounds Theorem (Rademacher complexity of L p loss functions) Let p ≥ 1 and G = { x �→ | h ( x ) − f ( x ) | p | h ∈ H } and | h ( x ) − f ( x ) | ≤ M for all x ∈ X and h ∈ H . Then for any sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } , the following inequality holds R ( G ) ≤ pM p − 1 ˆ ˆ R ( H ) . Proof (Rademacher complexity of L p loss functions). Let φ p : x �→ | x | p , then G = { φ p ◦ h | h ∈ H ′ } where H ′ = { x �→ h ( x ) − f ( x ) | h ∈ H ′ } . Since φ p is pM p − 1 -Lipschitz over [ − M , M ], we can apply Talagrand’s Lemma, R ( G ) ≤ pM p − 1 ˆ ˆ R ( H ′ ) . Now, ˆ R ( H ′ ) can be expressed as � m � R ( H ′ ) = 1 ˆ � sup ( σ i h ( x i ) + σ i f ( x i )) m E σ h ∈ H i =1 � m � � � m = 1 + 1 � � = ˆ sup σ i h ( x i ) σ i f ( x i ) R ( H ) . m E m E σ σ h ∈ H i =1 i =1 �� m = � m Since E σ i =1 σ i f ( x i ) � i =1 E σ [ σ i ] f ( x i ) = 0 . 5/35

  9. Rademacher complexity regression bounds Theorem (Rademacher complexity regression bounds) Let 0 ≤ L ≤ M be a bounded loss function such that for any fixed y ′ ∈ Y , L ( y , y ′ ) is µ -Lipschitz for some µ > 0 . Then, � � log 1 � m � ( x , y ) ∼D [ L ( h ( x ) , y )] ≤ 1 δ � L ( h ( x i ) , y i ) + 2 µ R m ( H ) + M E m 2 m i =1 � � log 1 � m � ( x , y ) ∼D [ L ( h ( x ) , y )] ≤ 1 δ � L ( h ( x i ) , y i ) + 2 µ ˆ R ( H ) + 3 M 2 m . E m i =1 Proof (Rademacher complexity of µ -Lipschitz loss functions). Since for any fixed y i , L ( y , y ′ ) is µ -Lipschitz for some µ > 0, by Talagrand’s Lemma, we can write � m � R ( G ) = 1 ˆ � σ i L ( h ( x i ) , y i ) m E σ i =1 � m � ≤ 1 � = µ ˆ σ i µ h ( x i ) R ( H ) . m E σ i =1 Combining this inequality with general Rademacher complexity learning bound completes proof. 6/35

  10. Pseudo-dimension bounds

  11. Shattering ◮ VC dimension is a measure of complexity of a hypothesis set. ◮ We define shattering for families of real-valued functions. ◮ Let G be a family of loss functions associated to some hypothesis set H , where G = { z = ( x , y ) �→ L ( h ( x ) , y ) | h ∈ H } Definition (Shattering) Let G be a family of functions from a set Z to R . A set { z 1 , . . . , z m } ∈ ( X × Y ) is said to be shattered by G if there exists t 1 , . . . , t m ∈ R such that �  �  �  sgn ( g ( z 1 ) − t 1 )  � � �   �  �  �  sgn ( g ( z 2 ) − t 2 )  �    �  �   = 2 m �   � � g ∈ G . �   � � . �  .  � �   �    �  �   �  �  � sgn ( g ( z m ) − t m )   � � � When they exist, the threshold values t 1 , . . . , t m are said to witness the shattering. In other words, S is shattered by G , if there are real numbers t 1 , . . . , t m such that for b ∈ { 0 , 1 } m , there is a function g b ∈ G with sgn ( g b ( x i ) − t i ) = b i for all 1 ≤ i ≤ m . 7/35

  12. Shattering ◮ Thus, { z 1 , . . . , z m } is shattered if for some witnesses t 1 , . . . , t m , the family of functions G is rich enough to contain a function going 1. above a subset A of the set of points J = { ( z i , t i ) | 1 ≤ i ≤ m } and 2. below the others J − A , for any choice of the subset A . t 1 t 2 z 1 z 2 ◮ For any g ∈ G , let B g be the indicator function of the region below or on the graph of g , that is B g ( x , y ) = sgn ( g ( x ) − y ) . ◮ Let B G = { B g | g ∈ G} . 8/35

  13. Pseudo-dimension ◮ The notion of shattering naturally leads to definition of pseudo-dimension. Definition (Pseudo-dimension) Let G be a family of functions from Z to R . Then, the pseudo-dimension of G , denoted by Pdim ( G ), is the size of the largest set shattered by G . If no such maximum exists, then Pdim ( G ) = ∞ . ◮ Pdim ( G ) coincides with VC of the corresponding thresholded functions mapping X to { 0 , 1 } . Pdim ( G ) = VC ( { ( x , t ) �→ I [( g ( x ) − t ) > 0] | g ∈ G} ) L ( h ( x ) , y ) t 1.5 1 L ( h ( x ) ,y ) >t 1.0 Loss 0.5 0.0 -2 -1 0 1 2 z ◮ Thus Pdim ( G ) = d , if there are real numbers t 1 , . . . , t d and 2 d functions g b that achieves all possible below/above combinations w.r.t t i . 9/35

  14. Properties of Pseudo-dimension Theorem (Composition with non-decreasing function) Suppose G is a class of real-valued functions and σ : R �→ R is a non-decreasing function. Let σ ( G ) denote the class { σ ◦ g | g ∈ G} . Then Pdim ( σ ( G )) ≤ Pdim ( G ) . Proof (Pseudo-dimension of hyperplanes). 1. For d ≤ Pdim ( σ ( G )), suppose � � b ∈ { 0 , 1 } d � � σ ◦ g b ⊆ σ ( G ) � shatters a set { x 1 , . . . , x d } ⊆ X witnessed by ( t 1 , . . . , t d ). 2. By suitably relabeling g b , for all { 0 , 1 } d and 1 ≤ i ≤ d , we have sgn ( σ ( g b ( x i )) − t i ) = b i . 3. For all 1 ≤ i ≤ d , take � � � σ ( g b ( x i )) ≥ t i , b ∈ { 0 , 1 } d � y i = min g b ( x i ) � 4. Since σ is non-decreasing, it is straightforward to verify that sgn ( g b ( x i ) − t i ) = b i for all { 0 , 1 } d and 1 ≤ i ≤ d 10/35

  15. Pseudo-dimension of vector spaces ◮ A class G of real-valued functions is a vector space if for all g 1 , g 2 ∈ G and any numbers λ, µ ∈ R , we have λ g 1 + µ g 2 ∈ G . Theorem (Pseudo-dimension of vector spaces) If G is a vector space of real-valued functions, then Pdim ( G ) = dim ( G ) . Proof (Pseudo-dimension of vector spaces). 1. Let B G be the class of below th graph indicator functions, we have Pdim ( G ) = VC ( B G ). 2. But B G = { ( x , y ) �→ sgn ( g ( x ) − y ) | g ∈ G} . 3. Hence, the functions B G are of the form sgn ( g 1 + g 2 ), where ◮ g 1 = g is a function from vector space ◮ g 2 is the fixed function g 2 ( x , y ) = − y . 4. Then, Theorem (Pseudo-dimension of vector spaces) shows that Pdim ( G ) = dim ( G ). ◮ Functions that map into some bounded range are not vector space. Corollary If G is a subset of a vector space G ′ of real valued functions then Pdim ( G ) ≤ dim ( G ′ ) 11/35

Recommend


More recommend