information theory and feature selection
play

Information Theory and Feature Selection (Joint Informativeness - PowerPoint PPT Presentation

Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1 , . . . , X D f 1 ( X 1 , . . . , X D ) ,


  1. Addressing Intractability I ( X ; Y ) = H ( X ) − H ( X | Y ) � = H ( X ) − P ( Y = y ) H ( X | Y = y ) � �� � � �� � Y intractable tractable Parametric model p X | Y = N ( µ y , Σ y ) H ( X | Y ) = 1 2 log( | Σ y | ) + n 2 (log 2 π + 1) . 28 / 66

  2. Maximum Entropy Distribution Given � ( x − µ )( x − µ ) T � E ( x ) = µ, E = Σ then the multivariate Normal � x ∼ N ( � µ, Σ) is the Maximum Entropy Distribution 29 / 66

  3. Addressing Intractability I ( X ; Y ) = H ( X ) − H ( X | Y ) � = H ( X ) − P ( Y = y ) H ( X | Y = y ) � �� � � �� � Y intractable tractable Parametric model p X | Y = N ( µ y , Σ y ) H ( X | Y ) = 1 2 log( | Σ y | ) + n 2 (log 2 π + 1) . 30 / 66

  4. Addressing Intractability I ( X ; Y ) = H ( X ) − H ( X | Y ) � − P ( Y = y ) H ( X | Y = y ) = H ( X ) � �� � � �� � Y intractable tractable �� � H ( X ) = H π y N ( µ y , Σ y ) y 31 / 66

  5. Addressing Intractability I ( X ; Y ) = H ( X ) − H ( X | Y ) � − P ( Y = y ) H ( X | Y = y ) = H ( X ) � �� � � �� � Y intractable tractable �� � H ( X ) = H π y N ( µ y , Σ y ) → Entropy of Mixture of Gaussians � �� � y No Analytical Solution 31 / 66

  6. Addressing Intractability I ( X ; Y ) = H ( X ) − H ( X | Y ) � − P ( Y = y ) H ( X | Y = y ) = H ( X ) � �� � � �� � Y intractable tractable �� � H ( X ) = H π y N ( µ y , Σ y ) → Entropy of Mixture of Gaussians � �� � y No Analytical Solution Upper Bound or Approximate �� � H π y N ( µ y , Σ y ) y 31 / 66

  7. A Normal Upper Bound � p X = π y p X | Y Y 32 / 66

  8. A Normal Upper Bound � p X = π y p X | Y Y p ∗ ∼ N ( µ ∗ , Σ ∗ ) maxEnt = ⇒ H ( p X ) ≤ H ( p ∗ ) 32 / 66

  9. A Normal Upper Bound � p X = π y p X | Y Y p ∗ ∼ N ( µ ∗ , Σ ∗ ) maxEnt = ⇒ H ( p X ) ≤ H ( p ∗ ) � I ( X ; Y ) ≤ H ( p ∗ ) − P Y H ( X | Y ) Y 32 / 66

  10. A Normal Upper Bound � p X = π y p X | Y Y p ∗ ∼ N ( µ ∗ , Σ ∗ ) maxEnt = ⇒ H ( p X ) ≤ H ( p ∗ ) � I ( X ; Y ) ≤ H ( p ∗ ) − P Y H ( X | Y ) Y � I ( X ; Y ) ≤ H ( Y ) = − p y log p y y 32 / 66

  11. A Normal Upper Bound � p X = π y p X | Y Y � I ( X ; Y ) ≤ H ( p ∗ ) − P Y H ( X | Y ) Y � I ( X ; Y ) ≤ H ( Y ) = − p y log p y y 33 / 66

  12. A Normal Upper Bound � p X = π y p X | Y Y � I ( X ; Y ) ≤ H ( p ∗ ) − P Y H ( X | Y ) Y � I ( X ; Y ) ≤ H ( Y ) = − p y log p y y � � � � H ( p ∗ ) , I ( X ; Y ) ≤ min P Y ( H ( X | Y ) − log( P Y )) − P Y H ( X | Y ) Y Y 33 / 66

  13. 2.6 f f* 2.4 Disjoint GC 2.2 Entropy 2 1.8 1.6 1.4 0 1 2 3 4 5 6 7 8 Mean difference 34 / 66

  14. An approximation � p X = π y p X | Y Y Under mild assumptions ∀ y , H ( p ∗ ) > H ( p X | Y = y ) we can use an approximation to I ( X ; Y ) � � ˜ min( H ( p ∗ ) , H ( X | Y ) − log p y ) p y − I ( X ; Y ) = H ( X | Y ) p y y y 35 / 66

  15. Feature Selection Criterium Mutual Information Approximation � � ˜ S = argmax I ( X S ′ (1) , X S ′ (2) , . . . , X S ′ ( K ) ; Y ) S ′ , | S ′ | = K 36 / 66

  16. Forward Selection S 0 = ∅ for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do ′ ← S k − 1 ∪ X j S z ← ˆ I ( S ′ ; Y ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for S i ← S ∗ end for return S K I ( S ′ ; Y ) ∝ � min (log( | Σ ∗ | ) , log( | Σ y | ) − log p y ) p y − � ˆ log( | Σ y | ) p y y y 37 / 66

  17. Complexity At iteration k we need to calculate ∀ j ∈ F \ S k − 1 � � � Σ S k − 1 ∪ X j � 38 / 66

  18. Complexity At iteration k we need to calculate ∀ j ∈ F \ S k − 1 � � � Σ S k − 1 ∪ X j � The cost of calculating each determinant is O ( k 3 ) 38 / 66

  19. Forward Selection S 0 = ∅ for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do Overall Complexity O ( | Y || F | K 4 ) ′ ← S k − 1 ∪ X j S z ← ˜ I ( S ′ ; Y ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for S i ← S ∗ end for return S K 39 / 66

  20. Forward Selection S 0 = ∅ for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do Overall Complexity O ( | Y || F | K 4 ) ′ ← S k − 1 ∪ X j S z ← ˜ I ( S ′ ; Y ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for However... S i ← S ∗ end for return S K 39 / 66

  21. Complexity � Σ S k − 1 � Σ j , S k − 1 Σ S k − 1 ∪ X j = Σ T σ 2 j j , S k − 1 40 / 66

  22. Complexity � Σ S k − 1 � Σ j , S k − 1 Σ S k − 1 ∪ X j = Σ T σ 2 j j , S k − 1 We can exploit the matrix determinant lemma (twice) � Σ + uv T � � � � 1 + v T Σ − 1 u � � � = | Σ | To compute each determinant in O ( n 2 ) 40 / 66

  23. Forward Selection S 0 = ∅ for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do Overall Complexity O ( | Y || F | K 3 ) ′ ← S k − 1 ∪ X j S z ← ˜ I ( S ′ ; Y ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for S i ← S ∗ end for return S K 41 / 66

  24. Forward Selection S 0 = ∅ for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do Overall Complexity O ( | Y || F | K 3 ) ′ ← S k − 1 ∪ X j S z ← ˜ I ( S ′ ; Y ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for Faster? S i ← S ∗ end for return S K 41 / 66

  25. Faster? I ( X , Z ; Y ) = I ( Z ; Y ) + I ( X ; Y | Z ) ′ I ( S k ; Y ) = I ( X j ; Y | S k − 1 ) + I ( S k − 1 ; Y ) � �� � common 42 / 66

  26. Faster? I ( X , Z ; Y ) = I ( Z ; Y ) + I ( X ; Y | Z ) ′ k ; Y ) = I ( X j ; Y | S k − 1 ) + ✘✘✘✘✘ ❳❳❳❳❳ ✘ I ( S I ( S k − 1 ; Y ) ❳ � �� � common argmax I ( X j ; Y | S k − 1 ) = X j ∈ F \ S k − 1 argmax ( H ( X j | S k − 1 ) − H ( X j | Y , S k − 1 )) X j ∈ F \ S k − 1 42 / 66

  27. Conditional Entropy � H ( X j | S k − 1 ) R | Sk − 1 | H ( X j | S k − 1 = s ) µ S k − 1 ( s ) ds = 1 j | S k − 1 + 1 2 log σ 2 = 2 (log 2 π + 1) σ 2 j | S k − 1 = σ 2 j − Σ T j , S k − 1 Σ − 1 S k − 1 Σ j , S k − 1 . 43 / 66

  28. Updating σ 2 j | S k − 1 σ 2 j | S k − 1 = σ 2 j , S k − 1 Σ − 1 j − Σ T S k − 1 Σ j , S k − 1 Assume X i was chosen at iteration k − 1 � Σ S k − 2 � Σ i , S k − 2 Σ S k − 1 = Σ T σ 2 i , S k − 2 i � Σ S k − 2 � 0 k − 2 + e n +1 Σ T i , S k − 2 + Σ i , S k − 2 e T = 0 T σ 2 n +1 n − 2 i 44 / 66

  29. Updating Σ − 1 S k − 1 From Sherman-Morrison formula = Σ − 1 − Σ − 1 uv T Σ − 1 � Σ + uv T � − 1 1 + v T Σ − 1 u 45 / 66

  30. Updating Σ − 1 S k − 1 From Sherman-Morrison formula = Σ − 1 − Σ − 1 uv T Σ − 1 � Σ + uv T � − 1 1 + v T Σ − 1 u � u � � Σ − 1 − 1 � � i u 1 S n − 2 βσ 2 � Σ − 1 u T S n − 1 = + 0 − 1 1 i u T βσ 2 0 βσ 2 βσ 2 i i 45 / 66

  31. Updating Σ − 1 S k − 1 � u � � Σ − 1 − 1 � � i u 1 S n − 2 βσ 2 � Σ − 1 u T S n − 1 = + 0 − 1 1 i u T βσ 2 0 βσ 2 βσ 2 i i Previous Round � �� � σ 2 j | S n − 1 = σ 2 Σ T j , S n − 2 Σ − 1 ∈ O ( n 2 ) j − S n − 2 Σ j , S n − 2 � − 1 � σ 2 i u βσ 2 ji u T Σ j , S n − 2 − Σ T σ 2 + ∈ O ( n ) j , S n − 1 1 ji βσ 2 βσ 2 i i � u � �� �� − 1 � � Σ T u T ∈ O ( n ) 0 Σ jS n − 1 j , S n − 1 βσ 2 0 i 46 / 66

  32. Faster! for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do ′ ← S k − 1 ∪ X j S Overall Complexity O ( | Y || F | K 2 ) z ← ˆ I ( S ′ ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for S i ← S ∗ end for return S K 47 / 66

  33. Faster! for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do ′ ← S k − 1 ∪ X j S Overall Complexity O ( | Y || F | K 2 ) z ← ˆ I ( S ′ ) if z > z ∗ then s ∗ ← s Even Faster? S ∗ ← S ′ end if end for S i ← S ∗ end for return S K 47 / 66

  34. Even Faster? Main bottleneck O ( k ) σ 2 j | S k − 1 = σ 2 jS k − 1 Σ − 1 j − Σ T S k − 1 Σ jS k − 1 48 / 66

  35. Even Faster? Main bottleneck O ( k ) σ 2 j | S k − 1 = σ 2 jS k − 1 Σ − 1 j − Σ T S k − 1 Σ jS k − 1 Skip non-promising features 48 / 66

  36. Forward Selection S 0 = ∅ for k = 1 . . . K do s ∗ = 0 for X j ∈ F \ S k − 1 do Cheap ( O (1)) score c : ′ ← S k − 1 ∪ X j S if c < z ∗ then z < z ∗ z ← ˆ I ( S ′ ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end for S i ← S ∗ end for return S K 49 / 66

  37. Forward Selection . . . z ∗ = 0 for X j ∈ F \ S k − 1 do ′ ← S k − 1 ∪ X j S Cheap ( O (1)) score c : c ← ? if c < z ∗ then z < z ∗ if c > z ∗ then z ← ˆ I ( S ′ ) if z > z ∗ then s ∗ ← s S ∗ ← S ′ end if end if end for . . . 49 / 66

  38. An O (1) Bound σ 2 j | S k − 1 = σ 2 j − Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 =Σ T jS k − 1 U Λ U T Σ jS k − 1 jS k − 1 � 2 jS n − 1 Σ − 1 jS k − 1 � 2 � Σ T λ i ≥ Σ T S k − 1 Σ jS k − 1 ≥ � Σ T 2 max 2 min λ i i i � �� � O (1) 50 / 66

  39. An O (1) Bound σ 2 j | S k − 1 = σ 2 j − Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 =Σ T jS k − 1 U Λ U T Σ jS k − 1 jS k − 1 � 2 jS n − 1 Σ − 1 jS k − 1 � 2 � Σ T λ i ≥ Σ T S k − 1 Σ jS k − 1 ≥ � Σ T 2 max 2 min λ i i i � �� � O (1) 50 / 66

  40. An O (1) Bound σ 2 j | S k − 1 = σ 2 j − Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 =Σ T jS k − 1 U Λ U T Σ jS k − 1 jS k − 1 � 2 jS n − 1 Σ − 1 jS k − 1 � 2 � Σ T λ i ≥ Σ T S k − 1 Σ jS k − 1 ≥ � Σ T 2 max 2 min λ i i i � �� � O (1) 50 / 66

  41. An O (1) Bound σ 2 j | S k − 1 = σ 2 j − Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 =Σ T jS k − 1 U Λ U T Σ jS k − 1 jS k − 1 � 2 jS n − 1 Σ − 1 jS k − 1 � 2 � Σ T λ i ≥ Σ T S k − 1 Σ jS k − 1 ≥ � Σ T 2 max 2 min λ i i i � �� � O (1) 50 / 66

  42. An O (1) Bound σ 2 j | S k − 1 = σ 2 j − Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 Σ T jS k − 1 Σ − 1 S k − 1 Σ jS k − 1 =Σ T jS k − 1 U Λ U T Σ jS k − 1 jS k − 1 � 2 jS n − 1 Σ − 1 jS k − 1 � 2 � Σ T λ i ≥ Σ T S k − 1 Σ jS k − 1 ≥ � Σ T 2 max 2 min λ i i i � �� � O (1) 50 / 66

  43. EigenSystem Update Problem Given U n , Λ n such that Σ n = U nT Λ n U n 51 / 66

  44. EigenSystem Update Problem Given U n , Λ n such that Σ n = U nT Λ n U n Find U n +1 , Λ n +1 � Σ n � Σ n +1 = v = U n +1 T Λ n +1 U n +1 v T 1 assume Σ n +1 ∈ S n +1 ++ 51 / 66

  45. EigenSystem Update   λ n · · · 0 1 . . U n = � � Λ n = ... u n u n u n  . .  . . . , . . 1 2 n   λ n 0 · · · n 52 / 66

  46. EigenSystem Update   λ n · · · 0 1 . . U n = � � Λ n = ... u n u n u n  . .  . . . , . . 1 2 n   λ n 0 · · · n � Σ n � � u n � u n � � 0 = λ n i i 0 T i 1 0 0 � �� � u ′ n i 52 / 66

  47. EigenSystem Update   λ n · · · 0 1 . . U n = � � Λ n = ... u n u n u n  . .  . . . , . . 1 2 n   λ n 0 · · · n � Σ n � � u n � u n � � 0 = λ n i i 0 T i 1 0 0 � �� � u ′ n i � Σ n � � 0 � 0 � � 0 = 0 T 1 1 1 52 / 66

  48. EigenSystem Update     T ′ n λ n · · · 0 � Σ n u � 1 1 0  .  . . ... � e n +1 � ′ n .  . .  = . . .   . . . u 0 T   1 1   � �� � e n +1 T 0 · · · 1 � �� � U ′ Σ ′ � �� � Λ ′ � v � v � T � Σ n +1 = Σ ′ + e n +1 e T + n +1 0 0 � � � v � v � T � Σ ′ + e n +1 u n +1 = λ n +1 u n +1 e T + n +1 0 0 53 / 66

  49. EigenSystem Update � � � v � v � T � Σ ′ + e n +1 u n +1 = λ n +1 u n +1 e T + n +1 0 0 � � � v � v � T � Σ ′ + e n +1 u n +1 = λ n +1 U ′ T u n +1 U ′ T U ′ U ′ T e T + n +1 0 0 � �� � U ′ U ′ T = I � � ((4)) Λ ′ + e n +1 q T + qe T U ′ T u n +1 = λ n +1 U ′ T u n +1 = ⇒ n +1 U ′ T Σ ′ U ′ = Λ ′ (4) 54 / 66

  50. EigenSystem Update � � Λ ′ + e n +1 q T + qe T U ′ T u n +1 = λ n +1 U ′ T u n +1 n +1 � �� � Σ ′′ ′′ share eigenvalues. → Σ n +1 and Σ → U n +1 = U ′ U ′′ 55 / 66

  51. EigenSystem Update � � � | Σ ′′ − λ I | = ′ ′ − q 2 ( λ j − λ ) + ( λ j − λ ) i j i < n +1 j � = i , j < n +1 − q 2 � f ( λ ) = λ ′ n +1 − λ + i ( λ ′ i − λ ) . i 56 / 66

  52. EigenSystem Update � � � | Σ ′′ − λ I | = ′ ′ − q 2 ( λ j − λ ) + ( λ j − λ ) i j i < n +1 j � = i , j < n +1 − q 2 � f ( λ ) = λ ′ n +1 − λ + i ( λ ′ i − λ ) . i ∀ i , f ( λ ) = + ∞ , f ( λ ) = −∞ lim lim λ > λ < → λ ′ i → λ ′ i − − − q 2 ∂ f ( λ ) � i = − 1 + ( λ ′ i − λ ) 2 ≤ 0 ∂λ i 56 / 66

  53. 57 / 66

  54. 30 Lapack Update 25 20 CPU time in secs 15 10 5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Matrix size Comparison between scratch and update 58 / 66

  55. −10 x 10 2.5 max | λ update − λ Lapack |/ λ Lapack 2 1.5 1 0.5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Matrix size Numerical stability (eigenvalues) 59 / 66

  56. Set to go . . . Now that all the machinery is in place, How did we do? 60 / 66

  57. Nice Results CIFAR STL INRIA SVMLin 10 50 100 10 50 100 10 50 100 Fisher 25.19 39.47 48.12 26.09 34.63 38.02 92.55 94.03 94.68 FCBF 33.65 47.77 54.97 31.74 38.11 40.66 94.14 96.03 96.03 MRMR 27.94 37.78 43.63 28.26 31.16 33.12 86.03 86.77 86.72 SBMLR 30.43 51.41 56.81 32.29 43.29 47.15 85.92 88.57 88.64 tTest 25.69 40.17 45.12 26.72 36.23 39.14 80.01 87.64 89.23 InfoGain 24.79 37.98 47.37 27.17 33.70 37.84 92.35 93.75 94.68 Spec. Clus. 17.19 32.78 42.6 18.91 32.65 38.24 92.67 93.64 94.44 RelieFF 24.56 38.17 46.51 29.16 38.05 42.94 90.99 95.97 96.36 CFS 31.49 42.17 51.70 28.63 38.54 41.88 88.64 96.11 97.53 CMTF 21.10 40.39 47.71 27.61 38.99 42.32 79.09 89.49 93.01 BAHSIC - - - 28.95 39.05 45.49 78.54 89.77 91.96 GC.E 32.45 50.15 55.06 31.20 43.31 49.75 87.73 91.96 93.13 GC. MI 36.47 51.44 55.39 32.50 44.15 48.88 89.76 95.71 96.45 GKL.E 37.51 52.11 56.41 33.44 44.27 50.54 85.31 92.05 96.36 GKL. MI 33.71 47.17 51.12 32.16 44.87 47.96 85.66 92.14 95.16 GC.MI was the fastest of the more complex algorithms 61 / 66

  58. Influence of Sample Size We use estimates Σ N = 1 ˆ N P T P For (sub)-Gaussian data we have 2 If N ≥ C ( t /ǫ ) 2 d then � ˆ Σ N − Σ � ≤ ǫ 2 with probability at least 1 − 2 e − ct 2 62 / 66

  59. Influence of Sample Size We use estimates Σ N = 1 ˆ N P T P For (sub)-Gaussian data we have 2 If N ≥ C ( t /ǫ ) 2 d then � ˆ Σ N − Σ � ≤ ǫ However the faster implementations use ˆ Σ − 1 N 2 with probability at least 1 − 2 e − ct 2 62 / 66

  60. Influence of Sample Size � Σ − 1 e � � Σ − 1 b � = � Σ − 1 �� Σ � , for d = 2048 and various values of N κ (Σ) = � e � � b � 63 / 66

  61. 50 45 40 Test Accuracy 35 30 10 features 25 features 50 features 25 0 500 1000 1500 2000 2500 Number of Samples per Class Effect of sample size on performance when using the Gaussian Approximation for the CIFAR dataset. 64 / 66

Recommend


More recommend