Adaptive Sampling Strategies for Stochastic Optimization Raghu - PowerPoint PPT Presentation

Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Test is designed to achieve descent directions sufficiently often Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

Inner Product Test First-Order Descent Condition ∇ F S k ( x k ) T ∇ F ( x k ) > 0 Holds in Expectation � � = �∇ F ( x k ) � 2 > 0 ∇ F S k ( x k ) T ∇ F ( x k ) E For descent condition to hold at most iterations, we impose bounds on the variance �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Test is designed to achieve descent directions sufficiently often Samples sizes required to satisfy this condition are smaller than those required for norm condition Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 7/27 7 / 27

Comparison ∇ F ∇ F ( a ) ( b ) Norm Test Inner Product Test � ≤ θ ip �∇ F ( x ) � 2 � g ( x ) T ∇ F ( x ) − �∇ F ( x ) � 2 � � � g ( x ) − ∇ F ( x ) � ≤ θ n �∇ F ( x ) � Figure: Given a gradient ∇ F the shaded areas denote the set of vectors satisfying the deterministic (a) Norm test (b) Inner Product test Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 8/27 8 / 27

Comparison Lemma Let | S ip | , | S n | represent the minimum number of samples required to satisfy the inner product test and norm test at any given iterate x and any given θ ip = θ n < 1 . Then we have | S ip | | S n | = β ( x ) ≤ 1 , where β ( x ) = E [ �∇ F i ( x ) � 2 cos 2 ( γ i )] − �∇ F ( x ) � 2 , E [ �∇ F i ( x ) � 2 ] − �∇ F ( x ) � 2 and γ i is the angle made by ∇ F i ( x ) with ∇ F ( x ) . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 9/27 9 / 27

Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Var i ∈ S k ( ∇ F i ( x k ) T ∇ F S k ( x k )) ≤ θ 2 ip �∇ F S k ( x k ) � 4 , | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Var i ∈ S k ( ∇ F i ( x k ) T ∇ F S k ( x k )) ≤ θ 2 ip �∇ F S k ( x k ) � 4 , | S k | 1 ∇ F i ( x k ) T ∇ F S k ( x k ) − �∇ F S k ( x k ) � 2 � 2 � � ∇ F i ( x k ) T ∇ F S k ( x k ) � � = Var i ∈ S k | S k | − 1 i ∈ S k Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

Test Approximation �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � E ≤ θ 2 ip �∇ F ( x k ) � 4 , θ ip ∈ [0 , 1) | S k | Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Var i ∈ S k ( ∇ F i ( x k ) T ∇ F S k ( x k )) ≤ θ 2 ip �∇ F S k ( x k ) � 4 , | S k | 1 ∇ F i ( x k ) T ∇ F S k ( x k ) − �∇ F S k ( x k ) � 2 � 2 � � ∇ F i ( x k ) T ∇ F S k ( x k ) � � = Var i ∈ S k | S k | − 1 i ∈ S k Whenever condition is not satisfied, increase sample size to satisfy the condition Backup Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 10/27 10 / 27

Overview Adaptive Sampling Tests 1 Norm test Inner Product Test Convergence Analysis 2 Orthogonal test Linear Convergence Practical Implementation 3 Step-Length Strategy Parameter Selection Numerical Experiments 4 Summary 5

Orthogonal Test Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � No restriction on near orthogonality of sample gradient and gradient Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � No restriction on near orthogonality of sample gradient and gradient Component of sample gradient orthogonal to gradient is 0 in expectation Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

Orthogonal Test Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to �∇ F ( x k ) � No restriction on near orthogonality of sample gradient and gradient Component of sample gradient orthogonal to gradient is 0 in expectation We control the variance in the orthogonal components of sampled gradients �� 2 � � ∇ F i ( x k ) − ∇ F i ( x k ) T ∇ F ( x k ) � ∇ F ( x k ) E � � �∇ F ( x k ) � 2 � ≤ ν 2 �∇ F ( x k ) � 2 , ν > 0 | S k | Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 12/27 12 / 27

Linear Convergence Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α = ip + ν 2 ) L , (1 + θ 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

Linear Convergence Theorem Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µ I � ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α = ip + ν 2 ) L , (1 + θ 2 we have that E [ F ( w k ) − F ( w ∗ )] ≤ ρ k ( F ( x 0 ) − F ( x ∗ )) , where ρ = 1 − µ 1 (1 + θ 2 L ip + ν 2 ) Convex Non-Convex Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 13/27 13 / 27

Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Constant stepsize can be employed but one needs to know the Lipschitz constant of the problem α k = 1 / L Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

Step-Length Selection x k +1 = x k − α k ∇ F S k ( x k ) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Constant stepsize can be employed but one needs to know the Lipschitz constant of the problem α k = 1 / L We propose to estimate the Lipschitz constant as we proceed, resulting in adaptive stepsizes Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 15/27 15 / 27

Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 L k ∇ F s k ( x k ) 3: Compute F new = F s k 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 Compute F new = F s k L k ∇ F s k ( x k ) 6: 7: end while Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 16/27 16 / 27

Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 L k ∇ F s k ( x k ) 3: Compute F new = F s k 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 Compute F new = F s k L k ∇ F s k ( x k ) 6: 7: end while Beck and Teboulle [2009] proposed this algorithm to estimate Lipschitz constant in deterministic settings Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 16/27 16 / 27

Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 L k ∇ F s k ( x k ) 3: Compute F new = F s k 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 Compute F new = F s k L k ∇ F s k ( x k ) 6: 7: end while Beck and Teboulle [2009] proposed this algorithm to estimate Lipschitz constant in deterministic settings Scmidt et al. [2016] adapted this algorithm to stochastic algorithms such as SAG Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 16/27 16 / 27

Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 3: Compute F new = F s k L k ∇ F s k ( x k ) 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 L k ∇ F s k ( x k ) Compute F new = F s k 6: 7: end while Similar to Line-Search on sampled functions with memory Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 17/27 17 / 27

Step-Length Selection Algorithm 1 Estimating Lipschitz constant Input: L k − 1 > 0, some η > 1 1: Compute parameter ζ k > 1 2: Set L k = L k − 1 /ζ k ; ⊲ Decrease the Lipschitz constant � � x k − 1 3: Compute F new = F s k L k ∇ F s k ( x k ) 2 L k �∇ F s k ( x k ) � 2 do 1 4: while F new > F s k ( x k ) − ⊲ sufficient decrease Set L k = η L k − 1 ⊲ Increase the Lipschitz constant 5: � � x k − 1 L k ∇ F s k ( x k ) Compute F new = F s k 6: 7: end while Similar to Line-Search on sampled functions with memory Only needs access to sampled function values Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 17/27 17 / 27

Aggressive Steps Heuristic Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 18/27 18 / 27

Aggressive Steps Heuristic It is well known and easy to show that E [ F ( x k +1 )] − F ( x k ) ≤ − α k � F ( x k ) � 2 + α 2 k L 2 E [ �∇ F S k ( x k ) � 2 ] Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 18/27 18 / 27

Aggressive Steps Heuristic It is well known and easy to show that E [ F ( x k +1 )] − F ( x k ) ≤ − α k � F ( x k ) � 2 + α 2 k L 2 E [ �∇ F S k ( x k ) � 2 ] Thus, we can obtain a decrease in the true objective, in expectation, if L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 18/27 18 / 27

Aggressive Steps Heuristic L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 19/27 19 / 27

Aggressive Steps Heuristic L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Using, α k = 1 / L k , assuming L k − 1 ≥ L , and sample approximations � Var ( ∇ F i ( x k )) � L k ≥ L k − 1 | S k |�∇ F S k ( x k ) � 2 + 1 , 2 1 i ∈ S k �∇ F i ( x k ) − ∇ F S k ( x k ) � 2 . where Var ( ∇ F i ( x k )) = � | S k |− 1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 19/27 19 / 27

Aggressive Steps Heuristic L α 2 E [ �∇ F S k ( x k ) − ∇ F ( x k ) � 2 ] + �∇ F ( x k ) � 2 � ≤ α k �∇ F ( x k ) � 2 k � 2 Using, α k = 1 / L k , assuming L k − 1 ≥ L , and sample approximations � Var ( ∇ F i ( x k )) � L k ≥ L k − 1 | S k |�∇ F S k ( x k ) � 2 + 1 , 2 1 i ∈ S k �∇ F i ( x k ) − ∇ F S k ( x k ) � 2 . where Var ( ∇ F i ( x k )) = � | S k |− 1 Therefore,   2 ζ k = max  1 ,   Var ( ∇ F i ( x k ))  | S k |�∇ F Sk ( x k ) � 2 + 1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 19/27 19 / 27

Parameter Selection Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction θ ip = 0 . 7 corresponds to 0 . 9 probability Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction θ ip = 0 . 7 corresponds to 0 . 9 probability θ ip = 0 . 9 works well in practice Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

Parameter Selection ∇ F S k ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 ∼ N (0 , 1) , � � √ σ | S k | �� ∇ F i ( x k ) T ∇ F ( x k ) − �∇ F ( x k ) � 2 � 2 � where σ 2 = E is the true variance. Parameter θ ip is directly proportional to probability of getting a descent direction θ ip = 0 . 7 corresponds to 0 . 9 probability θ ip = 0 . 9 works well in practice Orthogonal test is seldom active in practice and we choose ν = tan(80 o ) = 5 . 84 for all problems Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 20/27 20 / 27

Results: Constant Step-Length Strategy n R ( x ) = 1 log(1 + exp( − z i x T y i )) + λ � 2 � x � 2 n i =1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 22/27 22 / 27

Results: Constant Step-Length Strategy n R ( x ) = 1 log(1 + exp( − z i x T y i )) + λ � 2 � x � 2 n i =1 synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 10 1 10 2 Aug. Inner Product test Aug. Inner Product test Norm Test Norm Test 10 1 10 0 10 0 R(x) - R * R(x) - R * 10 -1 10 -1 10 -2 10 -2 10 -3 10 -3 10 -4 10 -4 0 20 40 60 80 100 0 100 200 300 400 500 600 Effective Gradient Evaluations Iterations Figure: Norm Test vs. Inner Product Test. Synthetic Dataset ( n = 7000) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 22/27 22 / 27

Results: Constant Step-Length Strategy synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 synthetic, α n = 2 0 , α ip = 2 0 , θ =0.90 100 120 Aug. Inner Product test Aug. Inner Product test 90 Norm Test Norm Test 100 80 (%) BatchSizes 70 80 60 Angle 50 60 40 40 30 20 20 10 0 0 100 200 300 400 500 100 200 300 400 500 Iterations Iterations Figure: Synthetic Dataset ( n = 7000) Other Datasets Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 23/27 23 / 27

Results: Adaptive Step-Length Strategy synthetic, θ =0.90 synthetic, θ =0.90 100 10 0 Aug. Inner Product test Aug. Inner Product test 90 Norm Test Norm Test 80 (%) BatchSizes 70 R(x) - R * 60 10 -1 50 40 30 20 10 10 -2 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 Effective Gradient Evaluations Iterations Figure: Synthetic Dataset ( n = 7000) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 24/27 24 / 27

Results: Adaptive Step-Length Strategy synthetic, θ =0.90 10 1 Aug. Inner Product test Norm Test Optimal Constant Steplength 10 0 Stepsizes 10 -1 10 -2 0 100 200 300 400 500 600 700 800 900 Iterations Figure: Synthetic Dataset ( n = 7000) Other Datasets Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 25/27 25 / 27

Summary Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

Summary Adaptive sampling methods are alternate methods for noise reduction Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test These methods can use adaptive stepsizes and second-order information can be incorporated Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

Summary Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test These methods can use adaptive stepsizes and second-order information can be incorporated Currently working on practical sampling tests to control sample sizes in stochastic quasi-Newton methods Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 27/27 27 / 27

Questions??? Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 1/13 1 / 13

Thank You Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 1/13 1 / 13

Appendix Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 2/13 2 / 13

Backup Mechanism Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios k = 1 def � g avg ∇ F S j ( x j ) r j = k − r +1 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

Backup Mechanism Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios k = 1 def � g avg ∇ F S j ( x j ) r j = k − r +1 r should be chosen such that the iterates in the summation are close enough and there are enough samples for g avg to be accurate ( r = 10). If � g avg � < γ �∇ F S k ( x k ) � , for some γ ∈ (0 , 1) then we use g avg instead of ∇ F S k ( x k ) in the tests. Main Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 3/13 3 / 13

Convex Functions Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α < ip + ν 2 ) L , (1 + θ 2 Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

Convex Functions Theorem (General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇ 2 F ( x ) � LI , ∀ x ∈ R d . Let { x k } be the iterates generated by subsampled gradient method with any x 0 , where | S k | is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θ ip > 0 and ν > 0 . Then, if the steplength satisfies 1 α k = α < ip + ν 2 ) L , (1 + θ 2 we have for any positive integer T, 1 0 ≤ k ≤ T − 1 E [ F ( x k )] − F ∗ ≤ 2 α cT � x 0 − x ∗ � 2 , min where the constant c > 0 is given by c = 1 − L α (1 + θ 2 ip + ν 2 ) . Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods US - Mexico Workshop - 2018 4/13 4 / 13

Adaptive Sampling Strategies for Stochastic Optimization Raghu - PowerPoint PPT Presentation

Adaptive Sampling Strategies for Stochastic Optimization Raghu (Vijaya Raghavendra) Bollapragada 1 Richard Byrd 2 Jorge Nocedal 1 1 Northwestern University 2 University of Colorado Boulder January 8, 2018 11 th US - Mexico Workshop on Optimization

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Medicare and Medicaid Audit Sampling Strategies Sampling Strategies Creating Sampling Plans and

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Modulation & Coding for the Gaussian Channel Trivandrum School on Communication, Coding &

Thank you for joining Please submit your questions using the Q&A box. Google Chrome is

Date: 18/08/2016 RBWH HD Clinic History Flashback First set up in 1994 by Psychiatrist Dr

The Bolton Pain Assessment Tool: Development & Initial Testing Dr Julie Gregory Nurse

WSJT-X New Codes, Modes and Tools for Weak-Signal Communication Joe Taylor K1JT EME Conference

Strategy and Knowledge Management Department learning for greater development impact September

Intro to FT8 Summary FT8 What is it, who uses it, why use it. What do you need to operate

Euro-Idea Fundacja Spoeczno Kulturalna Partner Presentation WHERE DO WE COME FROM? Contact

Adaptive Sampling Strategies for Stochastic Optimization Raghu - PowerPoint PPT Presentation

Adaptive Sampling Strategies for Stochastic Optimization Raghu (Vijaya Raghavendra) Bollapragada 1 Richard Byrd 2 Jorge Nocedal 1 1 Northwestern University 2 University of Colorado Boulder January 8, 2018 11 th US - Mexico Workshop on Optimization

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Medicare and Medicaid Audit Sampling Strategies Sampling Strategies Creating Sampling Plans and

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &amp;

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Modulation &amp; Coding for the Gaussian Channel Trivandrum School on Communication, Coding &amp;

Thank you for joining Please submit your questions using the Q&amp;A box. Google Chrome is

Date: 18/08/2016 RBWH HD Clinic History Flashback First set up in 1994 by Psychiatrist Dr

The Bolton Pain Assessment Tool: Development &amp; Initial Testing Dr Julie Gregory Nurse

WSJT-X New Codes, Modes and Tools for Weak-Signal Communication Joe Taylor K1JT EME Conference

Strategy and Knowledge Management Department learning for greater development impact September

Intro to FT8 Summary FT8 What is it, who uses it, why use it. What do you need to operate

Euro-Idea Fundacja Spoeczno Kulturalna Partner Presentation WHERE DO WE COME FROM? Contact

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS &

Modulation & Coding for the Gaussian Channel Trivandrum School on Communication, Coding &

Thank you for joining Please submit your questions using the Q&A box. Google Chrome is

The Bolton Pain Assessment Tool: Development & Initial Testing Dr Julie Gregory Nurse