complete statistical theory of learning learning using
play

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL - PowerPoint PPT Presentation

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik 1 PART I VC THEORY OF GENERALIZATION 2 THE MAIN QUESTION OF LEARNING THEORY QUESTION: When in set of functions { f ( x ) } we can minimize


  1. COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik 1

  2. PART I VC THEORY OF GENERALIZATION 2

  3. THE MAIN QUESTION OF LEARNING THEORY QUESTION: When in set of functions { f ( x ) } we can minimize functional � R ( f ) = L ( y , f ( x )) dP ( x , y ) , f ( x ) ∈ { f ( x ) } , if measure P ( x , y ) is unknown but we are given ℓ iid pairs ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) . ANSWER: We can minimize functional R ( f ) using data if and only if the VC-dimension h of set { f ( x ) } is finite. 3

  4. DEFINITION OF VC DIMENSION Let { θ ( f ( x )) } be a set of indicator functions (here θ ( u ) = 1 if u ≥ 0 and θ ( u ) = 0 if u < 0). • VC-dimension of set of indicator functions { θ ( f ( x )) } is equal h if h is the maximal number of vectors x 1 , ..., x h that can be shattered (separated into all 2 h possible subsets) us- ing indicator functions from { θ ( f ( x )) } . If such vectors exist for any number h the VC dimension of the set is infinite. • VC-dimension of set of real valued functions { f ( x ) } is the VC-dimension of the set of indicator functions { θ ( f ( x ) + b ) } 4

  5. TWO THEOREMS OF VC THEORY Theorem 1. If set { f ( x ) } has VC dimension h, then with probability 1 − η for all functions f ( x ) the bound � e 2 + 4eR ℓ R ( f ) ≤ R ℓ emp ( f ) + emp ( f ) , holds true, where � h − ln η � ℓ � emp ( f ) = 1 R ℓ L ( y i , f ( x i )) , e = O . ℓ ℓ i = 1 Theorem 2. Let x , w ∈ R n . The VC dimension h of set of linear indicator functions { θ ( x T w ) : || x || 2 ≤ 1 , || w || 2 ≤ C } is h ≤ min ( C , n ) + 1 5

  6. STRUCTURAL RISK MINIMIZATION PRINCIPLE To find the desired approximation f ℓ ( x ) in a set { f ( x ) } : FIRST, introduce a structure on a set of functions { f ( x ) } { f ( x ) } 1 ⊂ { f ( x ) } 2 ⊂ · · · { f ( x ) } m ⊂ { f ( x ) } with corresponding VC-dimension h k h 1 ≤ h 2 ≤ · · · ≤ h m ≤ ∞ . SECOND, chose the function f ℓ ( x ) that minimizes the bound � h k − ln η � � e 2 + 4eR ℓ R ( f ) ≤ R ℓ emp ( f ) + emp ( f ) , e = O . ℓ 1. over elements { f ( x ) } k (with VC dimension h k ) and 2. the function f ℓ ( x ) (with the smallest in { f ( x ) } k loss R ℓ emp ( f ) . 6

  7. FOUR QUESTIONS TO COMPLETE LEARNING THEORY 1. How to choose loss function L ( y , f ) in functional R ( f ) ? 2. How to select an admissible set of functions { f ( x ) } ? 3. How to construct structure on admissible set? 4. How to minimize functional on constructed structure? The talk answers these questions for pattern recognition problem. 7

  8. PART II TARGET FUNCTIONAL FOR MINIMIZATION 8

  9. SETTING OF PROBLEM: GOD PLAYS DICE Object Nature 𝑧 𝑗 𝑦 𝑗 ሻ ሻ 𝑄(𝑧|𝑦 𝑄(𝑦 𝑧 𝑗 𝑦 𝑗 Learning Machine 𝑦 𝑔 𝑦, 𝛽 , 𝛽𝜗Λ 𝑧 𝑦 1 , 𝑧 1 , … , 𝑦 𝓂 , 𝑧 𝓂 Given ℓ i.i.d. observations ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ∈ X , y ⊂ { 0 , 1 } generated by unknown P ( x , y ) = P ( y | x ) P ( x ) find the rule r ( x ) = θ ( f 0 ( x )) , which minimizes in a set { f ( x ) } probability of misclassification � R θ ( f ) = | y − θ ( f ( x )) | dP ( x , y ) 9

  10. STANDARD REPLACEMENT OF BASIC SETTING Using data ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ∈ X , y ⊂ { 0 , 1 } minimize in the set of functions { f ( x ) } the functional � ( y − f ( x )) 2 dP ( x , y ) R ( f ) = � (instead of functional R I θ ( f ) = | y − θ ( f ( x )) | dP ( x , y ) ). Minimizer f 0 ( x ) of R ( f ) estimates condition probability function f 0 ( x ) = P ( y = 1 | x ) . Use the classification rule r ( x ) = θ ( f 0 ( x ) − 0 . 5 ) = θ ( P ( y = 1 | x ) − 0 . 5 ) . 10

  11. PROBLEM WITH STANDARD REPLACEMENT Minimization of functional R ( f ) in the set { f ( x ) } is equiva- lent to minimization of the expression � � ( y − f ( x )) 2 dP ( x , y ) = [( y − f 0 ( x )) + ( f 0 ( x ) − f ( x ))] 2 dP ( x , y ) R ( f ) = where f 0 ( x ) minimizes R ( f ) . This is equivalent to minimiza- tion � ( y − f 0 ( x )) 2 dP ( x , y )+ R ( f ) = � � ( f 0 ( x ) − f ( x )) 2 dP ( x ) + 2 ( y − f 0 ( x ))( f 0 ( x ) − f ( x )) dP ( x , y ) . ACTUAL GOAL IS: USING ℓ OBSERVATIONS TO MINIMIZE THE SECOND INTEGRAL, NOT SUM OF LAST TWO INTEGRALS. 11

  12. DIRECT ESTIMATION OF CONDITIONAL PROBABILITY 1. When y ⊂ { 0 , 1 } the conditional probability P ( y = 1 | x ) is defined by some real valued function 0 ≤ f ( x ) ≤ 1. 2. From Bayesian formula P ( y = 1 | x ) p ( x ) = p ( y = 1 , x ) follows that any function G ( x − x ′ ) ∈ L 2 defines equation � � G ( x − x ′ ) f ( x ′ ) dP ( x ′ ) = G ( x − x ′ ) dP ( y = 1 , x ′ ) ( ∗ ) which solution is conditional probability f ( x ) = P ( y = 1 | x ) . 3. To estimate conditional probability means to solve the equation (*) when P ( x ) and P ( y = 1 , x ) are unknown but data, ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) generated according to P ( y , x ) , are given. 4. Solution of equation (*) is ill-posed problem. 12

  13. MAIN INDUCTIVE STEP IN STATISTICS Replace the unknown Cumulative Distribution Function (CDF) P ( x ) , x = ( x 1 , ..., x n )) T ∈ R n with it estimate P ℓ ( x ) : The Empir- ical Cumulative Distribution Function (ECDF) n ℓ � � P ℓ ( x ) = 1 θ { x k − x k θ { x − x i } , θ { x − x i } = i } ℓ i = 1 k = 1 obtained from data x i = ( x 1 i , ..., x n i ) T , x 1 , ..., x ℓ , The main theorem of statistics claims that ECDF converges to actual CDF uniformly with fast rate of convergence. The following inequality holds true x | P ( x ) − P ℓ ( x ) | > ε } < 2 exp {− 2 ε 2 ℓ } , P { sup ∀ ε. 13

  14. TWO CONSTRUCTIVE SETTINGS OF CLASSIFICATION PROBLEM 1. Standard constructive setting: Minimization of functional � ( y − f ( x )) 2 dP ℓ ( x , y ) , R emp ( f ) = in a set { f ( x ) } using data ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) leads to ℓ � R emp ( f ) = 1 ( y i − f ( x i )) 2 , f ( x ) ∈ { f ( x ) } . ℓ i = 1 • • • 2. New constructive setting: Solution of equation � � G ( x − x ′ ) f ( x ′ ) dP ℓ ( x ′ ) = G ( x − x ′ ) dP ℓ ( y = 1 , x ′ ) , using data leads to solution in { f ( x ) } the equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) , f ( x ) ∈ { f ( x ) } . ℓ ℓ i = 1 j = 1 14

  15. NADARAYA-WATSON ESTIMATOR OF CONDITIONAL PROBABILITY It is known Nadaraya-Watson estimator of P ( y = 1 | x ) : � ℓ i = 1 y i G ( x − x i ) f ( x ) = , � ℓ i = 1 G ( x − x i ) where special kernels G ( x − x i ) (say, Gaussian) are used. This estimator is the solution of ”corrupted” equation ℓ ℓ � � 1 G ( x − x i ) f ( x ) = 1 y i G ( x − x i ) ℓ ℓ i = 1 i = 1 (which uses special kernel) rather than the obtained equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) . ℓ ℓ i = 1 j = 1 (which is defined for any kernel G ( x − x ′ ) from L 2 ). 15

  16. WHAT MEANS TO SOLVE THE EQUATION To solve the equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) ℓ ℓ i = 1 j = 1 means to find the function in { f ( x ) } minimizing L 2 -distance �   2 ℓ ℓ � �   R ( f ) = G ( x − x i ) f ( x i ) − y j G ( x − x j ) d µ ( x ) . i = 1 j = 1 Simple algebra leads to expression ℓ � R V ( f ) = ( y i − f ( x i ))( y j − f ( x j )) v ( x i , x j ) , i , j = 1 where values v ( x i , x j ) are � v ( x i , x j ) = G ( x − x i ) G ( x − x j ) d µ ( x ) , i , j = 1 , ..., ℓ. Values v ( x i , x j ) form V -matrix. 16

  17. THE V -MATRIX ESTIMATE 1. For µ ( x ) = P ( x ) elements v ( x i , x j ) of V -matrix are � v ( x i , x j ) = G ( x − x i ) G ( x − x j ) dP ( x ) . Using empirical estimate P ℓ ( x ) instead of P ( x ) we obtain the following estimates of elements of V -matrix ℓ � v ( x i , x j ) = 1 G ( x s − x i ) G ( x s − x j ) . ℓ s = 1 2. For µ ( x ) = x , x ∈ ( − 1 , 1 ) and G ( x − x ′ ) = exp {− 0 . 5 δ 2 ( x − x ′ ) 2 } , v ( x i , x j ) = exp {− δ 2 ( x i − x j ) 2 }{ erf [ δ ( 1 + 0 . 5 ( x i + x j ))] + erf { δ ( 1 − 0 . 5 ( x i + x j ))] } . 17

  18. LEAST V -QUADRATIC FORM METHOD AND LEAST SQUARES METHOD Let ( x 1 , y 1 ) , ..., ( x ℓ , b ℓ ) be training data. Using notations: Y = ( y 1 , ..., y ℓ ) T , F ( f ) = ( f ( x 1 ) , ..., f ( x ℓ )) T , V = || v ( x i , x j ) || we can rewrite functional ℓ � R V ( f ) = ( y i − f ( x i ))( y j − f ( x j )) v ( x i , x j ) , i , j = 1 in matrix form R V ( f ) = ( Y − F ( f )) T V ( Y − F ( f )) , We call this functional Least V -quadratic functional. Identity matrix I instead of V forms Least Squares functional R I ( f ) = ( Y − F ( f )) T ( Y − F ( f )) , 18

  19. PART III SELECTION OF ADMISSIBLE SET OF FUNCTIONS 19

Recommend


More recommend