adaptive sampling strategies for stochastic optimization
play

Adaptive Sampling Strategies for Stochastic Optimization Raghu - PowerPoint PPT Presentation

Adaptive Sampling Strategies for Stochastic Optimization Raghu (Vijaya Raghavendra) Bollapragada 1 Richard Byrd 2 Jorge Nocedal 1 1 Northwestern University 2 University of Colorado Boulder January 8, 2018 11 th US - Mexico Workshop on Optimization


  1. Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

  2. Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

  3. Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

  4. Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Test is designed to achieve descent directions sufficiently often Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

  5. Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Test is designed to achieve descent directions sufficiently often Samples sizes required to satisfy this condition are smaller than those required for norm condition Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

  6. Comparison ∇ F ∇ F ( a ) ( b ) Norm Test Inner Product Test � ≤ θ ip �∇ F ( x ) � 2 � g ( x ) T ∇ F ( x ) − �∇ F ( x ) � 2 � � � g ( x ) − ∇ F ( x ) � ≤ θ n �∇ F ( x ) � Figure: Given a gradient ∇ F the shaded areas denote the set of vectors satisfying the deterministic (a) Norm test (b) Inner Product test Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 8/27 8 / 27

  7. Comparison Lemma Let | S ip | , | S n | represent the minimum number of samples required to satisfy the inner product test and norm test at any given iterate x and any given θ ip = θ n < 1 . Then we have | S ip | | S n | = β ( x ) ≤ 1 , where β ( x ) = E [ �∇ F i ( x ) � 2 cos 2 ( γ i )] − �∇ F ( x ) � 2 , E [ �∇ F i ( x ) � 2 ] − �∇ F ( x ) � 2 and γ i is the angle made by ∇ F i ( x ) with ∇ F ( x ) . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 9/27 9 / 27

  8. Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

  9. Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

  10. Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

  11. Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Var i ∈ S k ( ∇ F i ( x k ) T ∇ F S k ( x k )) ≤ θ 2 ip �∇ F S k ( x k ) � 4 , | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

  12. Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Var i ∈ S k ( ∇ F i ( x k ) T ∇ F S k ( x k )) ≤ θ 2 ip �∇ F S k ( x k ) � 4 , | S k | 1 ∇ F i ( x k ) T ∇ F S k ( x k ) − �∇ F S k ( x k ) � 2 � 2 � � ∇ F i ( x k ) T ∇ F S k ( x k ) � � = Var i ∈ S k | S k | − 1 i ∈ S k Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

  13. Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Var i ∈ S k ( ∇ F i ( x k ) T ∇ F S k ( x k )) ≤ θ 2 ip �∇ F S k ( x k ) � 4 , | S k | 1 ∇ F i ( x k ) T ∇ F S k ( x k ) − �∇ F S k ( x k ) � 2 � 2 � � ∇ F i ( x k ) T ∇ F S k ( x k ) � � = Var i ∈ S k | S k | − 1 i ∈ S k Whenever condition is not satisfied, increase sample size to satisfy the condition Backup Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

  14. Overview Adaptive Sampling Tests 1 Norm test Inner Product Test Convergence Analysis 2 Orthogonal test Linear Convergence Practical Implementation 3 Step-Length Strategy Parameter Selection Numerical Experiments 4 Summary 5

  15. Orthogonal Test Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

  16. Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

  17. Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � No restriction on near orthogonality of sample gradient and gradient Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

  18. Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � No restriction on near orthogonality of sample gradient and gradient Component of sample gradient orthogonal to gradient is 0 in expectation Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

  19. Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � No restriction on near orthogonality of sample gradient and gradient Component of sample gradient orthogonal to gradient is 0 in expectation We control the variance in the orthogonal components of sampled gradients �� 2 � � ∇ F i ( x k ) − ∇ F i ( x k ) T ∇ F ( x k ) � ∇ F ( x k ) E � � �∇ F ( x k ) � 2 � ≤ ν 2 �∇ F ( x k ) � 2 , ν > 0 | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

  20. Linear Convergence Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

  21. Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

  22. Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

  23. Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α = ip + ν 2 ) L , (1 + θ 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

  24. Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α = ip + ν 2 ) L , (1 + θ 2 we have that E [ F ( w k ) − F ( w ∗ )] ≤ ρ k ( F ( x 0 ) − F ( x ∗ )) , where ρ = 1 − µ 1 (1 + θ 2 L ip + ν 2 ) Convex Non-Convex Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

  25. Overview Adaptive Sampling Tests 1 Norm test Inner Product Test Convergence Analysis 2 Orthogonal test Linear Convergence Practical Implementation 3 Step-Length Strategy Parameter Selection Numerical Experiments 4 Summary 5

  26. Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

  27. Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

  28. Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

  29. Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Constant stepsize can be employed but one needs to know the Lipschitz constant of the problem α k = 1 / L Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

  30. Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Constant stepsize can be employed but one needs to know the Lipschitz constant of the problem α k = 1 / L We propose to estimate the Lipschitz constant as we proceed, resulting in adaptive stepsizes Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

  31. Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 L k ∇ F s k ( x k ) 3: Compute F new = F s k 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 Compute F new = F s k L k ∇ F s k ( x k ) 6: 7: end while Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 16/27 16 / 27

  32. Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 L k ∇ F s k ( x k ) 3: Compute F new = F s k 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 Compute F new = F s k L k ∇ F s k ( x k ) 6: 7: end while Beck and Teboulle [2009] proposed this algorithm to estimate Lipschitz constant in deterministic settings Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 16/27 16 / 27

  33. Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 L k ∇ F s k ( x k ) 3: Compute F new = F s k 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 Compute F new = F s k L k ∇ F s k ( x k ) 6: 7: end while Beck and Teboulle [2009] proposed this algorithm to estimate Lipschitz constant in deterministic settings Scmidt et al. [2016] adapted this algorithm to stochastic algorithms such as SAG Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 16/27 16 / 27

  34. Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 3: Compute F new = F s k L k ∇ F s k ( x k ) 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 L k ∇ F s k ( x k ) Compute F new = F s k 6: 7: end while Similar to Line-Search on sampled functions with memory Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 17/27 17 / 27

  35. Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 3: Compute F new = F s k L k ∇ F s k ( x k ) 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 L k ∇ F s k ( x k ) Compute F new = F s k 6: 7: end while Similar to Line-Search on sampled functions with memory Only needs access to sampled function values Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 17/27 17 / 27

  36. Aggressive Steps Heuristic Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 18/27 18 / 27

  37. Aggressive Steps Heuristic It is well known and easy to show that E [ F ( x k +1 )] − F ( x k ) ≤ − α k � F ( x k ) � 2 + α 2 k L 2 E [ �∇ F S k ( x k ) � 2 ] Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 18/27 18 / 27

  38. Aggressive Steps Heuristic It is well known and easy to show that E [ F ( x k +1 )] − F ( x k ) ≤ − α k � F ( x k ) � 2 + α 2 k L 2 E [ �∇ F S k ( x k ) � 2 ] Thus, we can obtain a decrease in the true objective, in expectation, if L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 18/27 18 / 27

  39. Aggressive Steps Heuristic L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 19/27 19 / 27

  40. Aggressive Steps Heuristic L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Using, α k = 1 / L k , assuming L k − 1 ≥ L , and sample approximations � Var ( ∇ F i ( x k )) � L k ≥ L k − 1 | S k |�∇ F S k ( x k ) � 2 + 1 , 2 1 i ∈ S k �∇ F i ( x k ) − ∇ F S k ( x k ) � 2 . where Var ( ∇ F i ( x k )) = � | S k |− 1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 19/27 19 / 27

  41. Aggressive Steps Heuristic L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Using, α k = 1 / L k , assuming L k − 1 ≥ L , and sample approximations � Var ( ∇ F i ( x k )) � L k ≥ L k − 1 | S k |�∇ F S k ( x k ) � 2 + 1 , 2 1 i ∈ S k �∇ F i ( x k ) − ∇ F S k ( x k ) � 2 . where Var ( ∇ F i ( x k )) = � | S k |− 1 Therefore,   2 ζ k = max  1 ,   Var ( ∇ F i ( x k ))  | S k |�∇ F Sk ( x k ) � 2 + 1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 19/27 19 / 27

  42. Parameter Selection Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

  43. Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

  44. Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

  45. Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction θ ip = 0 . 7 corresponds to 0 . 9 probability Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

  46. Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction θ ip = 0 . 7 corresponds to 0 . 9 probability θ ip = 0 . 9 works well in practice Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

  47. Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction θ ip = 0 . 7 corresponds to 0 . 9 probability θ ip = 0 . 9 works well in practice Orthogonal test is seldom active in practice and we choose ν = tan(80 o ) = 5 . 84 for all problems Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

  48. Overview Adaptive Sampling Tests 1 Norm test Inner Product Test Convergence Analysis 2 Orthogonal test Linear Convergence Practical Implementation 3 Step-Length Strategy Parameter Selection Numerical Experiments 4 Summary 5

  49. Results: Constant Step-Length Strategy n R ( x ) = 1 log(1 + exp( − z i x T y i )) + λ � 2 � x � 2 n i =1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 22/27 22 / 27

  50. Results: Constant Step-Length Strategy n R ( x ) = 1 log(1 + exp( − z i x T y i )) + λ � 2 � x � 2 n i =1 synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 10 1 10 2 Aug. Inner Product test Aug. Inner Product test Norm Test Norm Test 10 1 10 0 10 0 R(x) - R * R(x) - R * 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 10 -4 10 -4 0 20 40 60 80 100 0 100 200 300 400 500 600 Effective Gradient Evaluations Iterations Figure: Norm Test vs. Inner Product Test. Synthetic Dataset ( n = 7000) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 22/27 22 / 27

  51. Results: Constant Step-Length Strategy synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 100 120 Aug. Inner Product test Aug. Inner Product test 90 Norm Test Norm Test 100 80 (%) BatchSizes 70 80 60 Angle 50 60 40 40 30 20 20 10 0 0 100 200 300 400 500 100 200 300 400 500 Iterations Iterations Figure: Synthetic Dataset ( n = 7000) Other Datasets Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 23/27 23 / 27

  52. Results: Adaptive Step-Length Strategy synthetic, θ =0.90 synthetic, θ =0.90 100 10 0 Aug. Inner Product test Aug. Inner Product test 90 Norm Test Norm Test 80 (%) BatchSizes 70 R(x) - R * 60 10 -1 50 40 30 20 10 10 -2 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 Effective Gradient Evaluations Iterations Figure: Synthetic Dataset ( n = 7000) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 24/27 24 / 27

  53. Results: Adaptive Step-Length Strategy synthetic, θ =0.90 10 1 Aug. Inner Product test Norm Test Optimal Constant Steplength 10 0 Stepsizes 10 -1 10 -2 0 100 200 300 400 500 600 700 800 900 Iterations Figure: Synthetic Dataset ( n = 7000) Other Datasets Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 25/27 25 / 27

  54. Overview Adaptive Sampling Tests 1 Norm test Inner Product Test Convergence Analysis 2 Orthogonal test Linear Convergence Practical Implementation 3 Step-Length Strategy Parameter Selection Numerical Experiments 4 Summary 5

  55. Summary Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

  56. Summary Adaptive sampling methods are alternate methods for noise reduction Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

  57. Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

  58. Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

  59. Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test These methods can use adaptive stepsizes and second-order information can be incorporated Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

  60. Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test These methods can use adaptive stepsizes and second-order information can be incorporated Currently working on practical sampling tests to control sample sizes in stochastic quasi-Newton methods Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

  61. Questions??? Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 1/13 1 / 13

  62. Thank You Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 1/13 1 / 13

  63. Appendix Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 2/13 2 / 13

  64. Backup Mechanism Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

  65. Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

  66. Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

  67. Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

  68. Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios k = 1 def � g avg ∇ F S j ( x j ) r j = k − r +1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

  69. Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios k = 1 def � g avg ∇ F S j ( x j ) r j = k − r +1 r should be chosen such that the iterates in the summation are close enough and there are enough samples for g avg to be accurate ( r = 10). If � g avg � < γ �∇ F S k ( x k ) � , for some γ ∈ (0 , 1) then we use g avg instead of ∇ F S k ( x k ) in the tests. Main Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

  70. Convex Functions Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

  71. Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

  72. Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

  73. Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α < ip + ν 2 ) L , (1 + θ 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

  74. Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α < ip + ν 2 ) L , (1 + θ 2 we have for any positive integer T, 1 0 ≤ k ≤ T − 1 E [ F ( x k )] − F ∗ ≤ 2 α cT � x 0 − x ∗ � 2 , min where the constant c > 0 is given by c = 1 − L α (1 + θ 2 ip + ν 2 ) . Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

Recommend


More recommend