nonparametric prediction l aszl o gy orfi
play

Nonparametric prediction L aszl o Gy orfi Budapest University of - PowerPoint PPT Presentation

Nonparametric prediction L aszl o Gy orfi Budapest University of Technology and Economics Department of Computer Science and Information Theory e-mail: gyorfi@szit.bme.hu www.szit.bme.hu/ gyorfi 1 Universal prediction: squared


  1. Nonparametric prediction L´ aszl´ o Gy¨ orfi Budapest University of Technology and Economics Department of Computer Science and Information Theory e-mail: gyorfi@szit.bme.hu www.szit.bme.hu/ ∼ gyorfi 1

  2. Universal prediction: squared loss y i real valued x i vector valued At time instant i the predictor is asked to guess y i 1 , y i − 1 with knowledge of the past ( x 1 , . . . , x i , y 1 , . . . y i − 1 ) = ( x i ) 1 The predictor is a sequence of functions g = { g i } ∞ i =1 1 , y i − 1 g i ( x i ) is the estimate of y i 1 After n time instant the empirical squared error for the sequence x n 1 , y n 1 n � L n ( g ) = 1 ( g i ( x i 1 , y i − 1 ) − y i ) 2 . 1 n i =1 2

  3. Regression function estimation Y real valued X observation vector Regression problem E { ( Y − f ( X )) 2 } min f Regression function m ( x ) = E { Y | X = x } For each function f one has E { ( f ( X ) − Y ) 2 } = E { ( m ( X ) − Y ) 2 } + E { ( m ( X ) − f ( X )) 2 3

  4. Data: D n = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } Regression function estimate m n ( x ) = m n ( x, D n ) Usual consistency conditions: - m ( x ) is smooth - X has a density - Y is bounded Nonparametric features: - construction of the estimate - consistency 4

  5. Universal consistency Definition 1 The estimator m n is called (weakly) universally consistent if E { ( m ( X ) − m n ( X )) 2 } → 0 for all distributions of ( X, Y ) with E Y 2 < ∞ . 5

  6. Local averaging estimates Stone (1977) n � m n ( x ) = W ni ( x ; X 1 , . . . , X n ) Y i . i =1 6

  7. k -nearest neighbor estimate W ni is 1 /k if X i is one of the k nearest neighbors of x among X 1 , . . . , X n , and W ni is 0 otherwise. Theorem 1 If k n → ∞ , k n /n → 0 then the k -nearest neighbor estimate is weakly universally consistent. 7

  8. Partitioning estimate Partition P n = { A n, 1 , A n, 2 . . . } � n Y i K n ( x, X i ) i =1 m n ( x ) = , � n K n ( x, X i ) i =1 where K n ( x, u ) = � I [ x ∈ A n,j ,u ∈ A n,j ] . j 8

  9. Theorem 2 If for all sphere S centered at the origin lim sup diam( A n,j ) = 0 n →∞ j ; A n,j ∩ S � =0 and |{ j ; A n,j ∩ S � = 0 }| lim = 0 n n →∞ then the partitioning estimate is weakly universally consistent. Example: A n,j are cubes with volume h d n , h n → 0, nh d n → ∞ 9

  10. Kernel estimate Kernel function K ( x ) ≥ 0 Bandwidth h n > 0 n � x − X i � � Y i K h i =1 m n ( x ) = n � x − X i � � K h i =1 Theorem 3 If h n → 0 , nh d n → ∞ then under some conditions on K the kernel estimate is weakly universally consistent. 10

  11. Least squares estimates empirical L 2 error n � 1 | f ( X j ) − Y j | 2 n j =1 class of functions F n select a function from F n which minimizes the empirical error: m n ∈ F n and n n � � 1 1 | m n ( X j ) − Y j | 2 = min | f ( X j ) − Y j | 2 . n n f ∈F n j =1 j =1 the class F n grows slowly as n grows 11

  12. Examples for F n : - polynomials - splines - neural networks - radial basis functions 12

  13. Dependent data: time series The data D n = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } are dependent long-range dependent form a stationary and ergodic process For given n , the problem is the following minimization: E { ( g ( X n +1 , D n ) − Y n +1 ) 2 } . min g The best predictor is the conditional expectation E { Y n +1 | X n +1 , D n } , which cannot be learned from data 13

  14. there is no prediction sequence with n →∞ ( g n ( X n +1 , D n ) − E { Y n +1 | X n +1 , D n } ) = 0 lim a.s. for all stationary and ergodic sequence. our aim is to achieve the optimum L ∗ = lim E { ( g ( X n +1 , D n ) − Y n +1 ) 2 } , n →∞ min g which is impossible 14

  15. Universal consistency there are universal Ces´ aro consistent prediction sequence g n n � 1 ( g i ( X i +1 , D i ) − Y i +1 ) 2 = L ∗ lim n n →∞ i =1 a.s. for all stationary and ergodic sequence. Such prediction sequence is called universally consistent . We show a construction of universally consistent predictor by combination of predictors (experts). 15

  16. Lemma Let ˜ h 1 , ˜ h 2 , . . . be a sequence of prediction strategies (experts), and let { q k } be a probability distribution on the set of positive integers. Assume that ˜ h i ( y n − 1 ) ∈ [ − B, B ] and y n 1 ∈ [ − B, B ] n . Define 1 w t,k = q k e − ( t − 1) L t − 1 (˜ h k ) /c with c ≥ 8 B 2 , and w t,k v t,k = . ∞ � w t,i i =1 Reminder: n � L n ( g ) = 1 1 , y i − 1 ( g i ( x i ) − y i ) 2 . 1 n i =1 16

  17. If the prediction strategy ˜ g is defined by ∞ � v t,k ˜ g t ( y t − 1 h k ( y t − 1 ˜ ) = ) t = 1 , 2 , . . . 1 1 k =1 then for every n ≥ 1, � � h k ) − c ln q k L n (˜ L n (˜ g ) ≤ inf . n k 17

  18. Special case: N predictors { q k } is the uniform distribution then h k ) + c ln N k L n (˜ L n (˜ g ) ≤ min . n 18

  19. Dependent data: time series stationary and ergodic data ( X 1 , Y 1 ) , . . . , ( X n , Y n ). Assume that | Y 0 | ≤ B . An elementary predictor (expert) is denoted by h ( k,ℓ ) , k, ℓ = 1 , 2 , . . . . Let G ℓ be a quantizer of R d and H ℓ be a quantizer of R . For given k, ℓ , let I n be the set of time instants k < i < n , for which there is a match of k -length quantized sequences: G ℓ ( x i i − k ) = G ℓ ( x n n − k ) and H ℓ ( y i − 1 i − k ) = H ℓ ( y n − 1 n − k ) . 19

  20. Then the prediction of this expert is the averages of y i ’s if i ∈ I n : � i ∈ I n y i 1 , y n − 1 h ( k,ℓ ) ( x n ) = . n 1 | I n | These predictors are not universally consistent since for small k the bias is large and large k the variance is large because of the few matchings. The same is true for the quatizers. The problem is how to choose k, ℓ in a data dependent way. The solution is the combination of experts. 20

  21. The combination of predictors can be derived according to the previous lemma. Let { q k,ℓ } be a probability distribution over ( k, ℓ ), and for c = 8 B 2 put w t,k,ℓ = q k,ℓ e − ( t − 1) L t − 1 ( h ( k,ℓ ) ) /c and w t,k,ℓ v t,k,ℓ = . ∞ � w t,i,j i,j =1 Then the combined prediction ∞ � 1 , y t − 1 1 , y t − 1 g t ( x t v t,k,ℓ h ( k,ℓ ) ( x t ) = ) . 1 1 k,ℓ =1 21

  22. Theorem If the quantizers G ℓ and H ℓ “are asymptotically fine”, and P { Y i ∈ [ − B, B ] } = 1, then the combined predictor g is universally consistent. L. Gy¨ orfi, G. Lugosi (2001) ”Strategies for sequential prediction of stationary time series”, in Modelling Uncertainty: An Examination of its Theory, Methods and Applications , M. Dror, P. L’Ecuyer, F. Szidarovszky (Eds.), pp. 225-248, Kluwer Academic Publisher. 22

  23. 0 − 1 loss y i takes values in the finite set { 1 , 2 , . . . M } . At time instant i the 1 , y i − 1 classifier decides on y i based on the past observation ( x i ). 1 After n round the empirical error for x n 1 , y n 1 is n � L n ( g ) = 1 I { g ( x i ) � = y i } , 1 ,y i − 1 n 1 i =1 i.e., the loss is the 0 − 1 loss, and L n ( g ) is the relative frequency of errors. 23

  24. Pattern recognition Y { 1 , 2 , . . . , M } valued X feature vector Classifier g : R d → { 1 , 2 , . . . , M } . Probability of error: L g = P ( g ( X ) � = Y ) . a posteriori probability P i ( x ) = P { Y = i | X = x } . Bayes decision g ∗ ( x ) = arg max P i ( x ) . i L ∗ Bayes error 24

  25. Universal consistency Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) g n ( x ) = g n (( X 1 , Y 1 ) , . . . , ( X n , Y n ) , x ) . Definition 2 The classifier g n is called (weakly) universally consistent if P ( g n ( X ) � = Y ) → L ∗ for all distributions of ( X, Y ) . 25

  26. Local majority voting k -nearest neighbor rule n � g n ( x ) = arg max W n,i ( x ) I { Y i = j } , j i =1 Partitioning rule : n � g n ( x ) = arg max I { X i ∈ A n ( x ) } I { Y i = j } j i =1 Kernel rule rule : � X i − x � n � g n ( x ) = arg max K I { Y i = j } . h j i =1 The k -NN rule and the partitioning rule and the kernel rule are strongly universally consistent. 26

  27. Empirical error minimization empirical error n � 1 I { g ( X j ) � = Y j } n j =1 class of classifiers G n select a classifier from G n which minimizes the empirical error: g n ∈ G n and n n � � 1 1 I { g n ( X j ) � = Y j } = min I { g ( X j ) � = Y j } n n g ∈G n j =1 j =1 the VC dimension of G n grows slowly as n grows 27

  28. Examples for G n : - polynomial classifiers - tree classifiers - neural networks classifiers - radial basis functions classifiers 28

  29. Dependent data: time series data D n = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } form a stationary and ergodic process For given n , the problem is the following minimization: min P { g ( X n +1 , D n ) � = Y n +1 } , g which cannot be learned from data our aim is to achieve the optimum R ∗ = lim n →∞ min P { g ( X n +1 , D n ) � = Y n +1 } , g which is impossible 29

  30. there are universal Ces´ aro consistent classifier sequence g n : n � 1 I { g i ( X i +1 ,D i ) � = Y i +1 } = R ∗ lim n n →∞ i =1 a.s. for all stationary and ergodic sequence Such classifier sequence is called universally consistent . 30

Recommend


More recommend