active regression via linear sample sparsification
play

Active Regression via Linear-Sample Sparsification Xue Chen Eric - PowerPoint PPT Presentation

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 1 / 18 Agnostic learning Xue Chen, Eric Price (UT Austin) Active Regression


  1. Agnostic learning of linear spaces: results Degree 5 polynomial, σ = 1, x ∈ [ − 1 , 1]. (Matrix) Chernoff bound depends on 1 / d 2 f ( x ) 2 . K := sup sup x f ∈ F � f � D =1 O ( K log d + K ǫ ) samples suffice for agnostic learning [Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14] D ≤ ǫ � f ∗ − y � 2 ◮ Mean zero noise: � � f − f ∗ � 2 D Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

  2. Agnostic learning of linear spaces: results Degree 5 polynomial, σ = 1, x ∈ [ − 1 , 1]. (Matrix) Chernoff bound depends on 1 / d 2 f ( x ) 2 . K := sup sup x f ∈ F � f � D =1 O ( K log d + K ǫ ) samples suffice for agnostic learning [Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14] D ≤ ǫ � f ∗ − y � 2 ◮ Mean zero noise: � � f − f ∗ � 2 D � � ◮ Generic noise: f − f � 2 D ≤ (1 + ǫ ) � f − y � 2 D Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

  3. Agnostic learning of linear spaces: results Degree 5 polynomial, σ = 1, x ∈ [ − 1 , 1]. (Matrix) Chernoff bound depends on 1 / d 2 f ( x ) 2 . K := sup sup x f ∈ F � f � D =1 O ( K log d + K ǫ ) samples suffice for agnostic learning [Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14] D ≤ ǫ � f ∗ − y � 2 ◮ Mean zero noise: � � f − f ∗ � 2 D � � ◮ Generic noise: f − f � 2 D ≤ (1 + ǫ ) � f − y � 2 D Also necessary (coupon collector) Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

  4. Agnostic learning of linear spaces: results Degree 5 polynomial, σ = 1, x ∈ [ − 1 , 1]. (Matrix) Chernoff bound depends on 1 / d 2 f ( x ) 2 . K := sup sup x f ∈ F � f � D =1 O ( K log d + K ǫ ) samples suffice for agnostic learning [Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14] D ≤ ǫ � f ∗ − y � 2 ◮ Mean zero noise: � � f − f ∗ � 2 D � � ◮ Generic noise: f − f � 2 D ≤ (1 + ǫ ) � f − y � 2 D Also necessary (coupon collector) How can we avoid the dependence on K ? Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

  5. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  6. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  7. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  8. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  9. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). ◮ Know D (which just defines � f − � f � D ). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  10. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). ◮ Know D (which just defines � f − � f � D ). Active learning model: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  11. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). ◮ Know D (which just defines � f − � f � D ). Active learning model: ◮ Receive x 1 , . . . , x m ∼ D Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  12. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). ◮ Know D (which just defines � f − � f � D ). Active learning model: ◮ Receive x 1 , . . . , x m ∼ D ◮ Pick S ⊂ [ m ] of size s Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  13. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). ◮ Know D (which just defines � f − � f � D ). Active learning model: ◮ Receive x 1 , . . . , x m ∼ D ◮ Pick S ⊂ [ m ] of size s ◮ See y i for i ∈ S . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  14. Our result: avoid K with more powerful access patterns With more powerful access models, can replace f ( x ) 2 K := sup sup x f ∈ F � f � D =1 with f ( x ) 2 . κ := E sup x f ∈ F � f � D =1 For linear spaces of functions, κ = d . Query model: ◮ Can pick x i of our choice, see y i ∼ ( Y | X = x i ). ◮ Know D (which just defines � f − � f � D ). Active learning model: ◮ Receive x 1 , . . . , x m ∼ D ◮ Pick S ⊂ [ m ] of size s ◮ See y i for i ∈ S . Some results for non-linear spaces. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

  15. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  16. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . This takes O ( K log d ) samples from D . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  17. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . This takes O ( K log d ) samples from D . Improved by biasing samples towards high-variance points. D ′ ( x ) = f ( x ) 2 D ( x ) sup f ∈ F � f � D =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  18. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . This takes O ( K log d ) samples from D . Improved by biasing samples towards high-variance points. D ′ ( x ) = 1 f ( x ) 2 κ D ( x ) sup f ∈ F � f � D =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  19. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . This takes O ( K log d ) samples from D . Improved by biasing samples towards high-variance points. D ′ ( x ) = 1 f ( x ) 2 κ D ( x ) sup f ∈ F � f � D =1 Estimate norm via m � S , D ′ := 1 D ( x i ) � f � 2 D ′ ( x i ) f ( x i ) 2 m i =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  20. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . This takes O ( K log d ) samples from D . Improved by biasing samples towards high-variance points. D ′ ( x ) = 1 f ( x ) 2 κ D ( x ) sup f ∈ F � f � D =1 Estimate norm via m � S , D ′ := 1 D ( x i ) � f � 2 D ′ ( x i ) f ( x i ) 2 m i =1 Still equals � f � 2 D in expectation, but now max contribution is κ . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  21. Query model: basic approach ERM needs empirical norm to � f � S to approximate � f � D for all f ∈ F . This takes O ( K log d ) samples from D . Improved by biasing samples towards high-variance points. D ′ ( x ) = 1 f ( x ) 2 κ D ( x ) sup f ∈ F � f � D =1 Estimate norm via m � S , D ′ := 1 D ( x i ) � f � 2 D ′ ( x i ) f ( x i ) 2 m i =1 Still equals � f � 2 D in expectation, but now max contribution is κ . ◮ This gives O ( κ log d ) sample complexity by Matrix Chernoff. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

  22. Bounding κ for linear function spaces f ( x ) 2 κ = E sup x f ∈ F � f � D =1 Express f ∈ F via an orthonormal basis: � f ( x ) = α j φ j ( x ) . j Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

  23. Bounding κ for linear function spaces f ( x ) 2 κ = E sup x f ∈ F � f � D =1 Express f ∈ F via an orthonormal basis: � f ( x ) = α j φ j ( x ) . j Then f ( x ) 2 = � α, { φ j ( x ) } d j =1 � 2 sup sup � f � D =1 � α � 2 =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

  24. Bounding κ for linear function spaces f ( x ) 2 κ = E sup x f ∈ F � f � D =1 Express f ∈ F via an orthonormal basis: � f ( x ) = α j φ j ( x ) . j Then d � f ( x ) 2 = j =1 � 2 = � α, { φ j ( x ) } d φ j ( x ) 2 . sup sup � f � D =1 � α � 2 =1 j =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

  25. Bounding κ for linear function spaces f ( x ) 2 κ = E sup x f ∈ F � f � D =1 Express f ∈ F via an orthonormal basis: � f ( x ) = α j φ j ( x ) . j Then d � f ( x ) 2 = j =1 � 2 = � α, { φ j ( x ) } d φ j ( x ) 2 . sup sup � f � D =1 � α � 2 =1 j =1 Hence d � x φ j ( x ) 2 = d . κ = E j =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

  26. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  27. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  28. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  29. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Can we bring this down to O ( d )? Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  30. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Can we bring this down to O ( d )? ◮ Not with independent sampling (coupon collector). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  31. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Can we bring this down to O ( d )? ◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  32. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Can we bring this down to O ( d )? ◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. ◮ Yes – using Lee-Sun sparsification. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  33. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Can we bring this down to O ( d )? ◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. ◮ Yes – using Lee-Sun sparsification. Mean zero noise: E [( � f ( x ) − f ( x )) 2 ] ≤ ǫ E [( y − f ( x )) 2 ]. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  34. Query model: so far Upsampling x proportional to sup f f ( x ) 2 gets O ( d log d ) sample complexity. ◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Can we bring this down to O ( d )? ◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. ◮ Yes – using Lee-Sun sparsification. Mean zero noise: E [( � f ( x ) − f ( x )) 2 ] ≤ ǫ E [( y − f ( x )) 2 ]. Generic noise: E [( � f ( x ) − f ( x )) 2 ] ≤ (1 + ǫ ) E [( y − f ( x )) 2 ]. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

  35. Active learning Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  36. Active learning Query model supposes we know D and can query any point. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  37. Active learning Query model supposes we know D and can query any point. Active learning: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  38. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  39. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  40. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  41. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  42. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  43. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  44. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  45. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  46. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  47. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : ◮ Label every point = ⇒ agnostic learning. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  48. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : ◮ Label every point = ⇒ agnostic learning. ◮ Hence m = Θ( K log d + K ǫ ) optimal. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  49. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : ◮ Label every point = ⇒ agnostic learning. ◮ Hence m = Θ( K log d + K ǫ ) optimal. Our result: both at the same time. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  50. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : ◮ Label every point = ⇒ agnostic learning. ◮ Hence m = Θ( K log d + K ǫ ) optimal. Our result: both at the same time. ◮ In this talk: mostly s = O ( d log d ) version. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  51. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : ◮ Label every point = ⇒ agnostic learning. ◮ Hence m = Θ( K log d + K ǫ ) optimal. Our result: both at the same time. ◮ In this talk: mostly s = O ( d log d ) version. ◮ Prior work: s = O (( d log d ) 5 / 4 ) [Sabato-Munos ’14], Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  52. Active learning Query model supposes we know D and can query any point. Active learning: ◮ Get x 1 , . . . , x m ∼ D . ◮ Pick S ⊆ [ m ] of size s . ◮ Learn y i for i ∈ S . Minimize s : ◮ m → ∞ = ⇒ learn D and query any point = ⇒ query model. ◮ Hence s = Θ( d ) optimal. Minimize m : ◮ Label every point = ⇒ agnostic learning. ◮ Hence m = Θ( K log d + K ǫ ) optimal. Our result: both at the same time. ◮ In this talk: mostly s = O ( d log d ) version. ◮ Prior work: s = O (( d log d ) 5 / 4 ) [Sabato-Munos ’14], s = O ( d log d ) via “volume sampling” [Derezinski-Warmuth-Hsu ’18]. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

  53. Active learning Warmup: suppose we know D . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  54. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: f ( x i ) 2 . Pr[Label x i ] ∝ sup f ∈ F � f � D =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  55. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  56. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 Just needs s = O ( d log d ). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  57. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 Just needs s = O ( d log d ). Chance each sample gets labeled is x [Pr[Label x i ]] = κ E K Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  58. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 Just needs s = O ( d log d ). Chance each sample gets labeled is x [Pr[Label x i ]] = κ K = d K . E Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  59. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 Just needs s = O ( d log d ). Chance each sample gets labeled is x [Pr[Label x i ]] = κ K = d K . E Gives m = O ( K log d ) unlabeled samples, s = O ( d log d ) labeled samples. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  60. Active learning Warmup: suppose we know D . Can simulate the query algorithm via rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 Just needs s = O ( d log d ). Chance each sample gets labeled is x [Pr[Label x i ]] = κ K = d K . E Gives m = O ( K log d ) unlabeled samples, s = O ( d log d ) labeled samples. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

  61. Active learning without knowing D Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

  62. Active learning without knowing D Want to perform rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 but don’t know D . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

  63. Active learning without knowing D Want to perform rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 but don’t know D . Just need to estimate � f � D for all f ∈ F . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

  64. Active learning without knowing D Want to perform rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 but don’t know D . Just need to estimate � f � D for all f ∈ F . Matrix Chernoff gets this with m = O ( K log d ) unlabeled samples. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

  65. Active learning without knowing D Want to perform rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 but don’t know D . Just need to estimate � f � D for all f ∈ F . Matrix Chernoff gets this with m = O ( K log d ) unlabeled samples. Gives m = O ( K log d ) unlabeled samples, s = O ( d log d ) labeled samples. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

  66. Active learning without knowing D Want to perform rejection sampling: Pr[Label x i ] = 1 f ( x i ) 2 . sup K f ∈ F � f � D =1 but don’t know D . Just need to estimate � f � D for all f ∈ F . Matrix Chernoff gets this with m = O ( K log d ) unlabeled samples. Gives m = O ( K log d ) unlabeled samples, s = O ( d log d ) labeled samples. Can improve to m = O ( K log d ), s = O ( d ). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

  67. Getting to s = O ( d ) Based on Lee-Sun ’15 Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

  68. Getting to s = O ( d ) Based on Lee-Sun ’15 O ( d log d ) comes from coupon collector. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

  69. Getting to s = O ( d ) Based on Lee-Sun ’15 O ( d log d ) comes from coupon collector. Change to non-independent sampling: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

  70. Getting to s = O ( d ) Based on Lee-Sun ’15 O ( d log d ) comes from coupon collector. Change to non-independent sampling: ◮ x i ∼ D i where D i depends on x 1 , . . . , x i − 1 . Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

  71. Getting to s = O ( d ) Based on Lee-Sun ’15 O ( d log d ) comes from coupon collector. Change to non-independent sampling: ◮ x i ∼ D i where D i depends on x 1 , . . . , x i − 1 . ◮ D 1 = D ′ , D 2 avoids points near x 1 , etc. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

Recommend


More recommend