Robust ML Training with Conditional Gradients Sebastian Pokutta Technische Universität Berlin and Zuse Institute Berlin pokutta@math.tu-berlin.de @spokutta CO@Work 2020 Summer School September, 2020 Berlin Mathematics Research Center MATH
Opportunities in Berlin Shameless plug Postdoc and PhD positions in optimization/ML. At Zuse Institute Berlin and TU Berlin . 1 / 14 Sebastian Pokutta · Training with Conditional Gradients
What is this talk about? Introduction Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks? Outline A simple example The basic setup of supervised Machine Learning Stochastic Gradient Descent Stochastic Conditional Gradient Descent (Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition. 2 / 14 Sebastian Pokutta · Training with Conditional Gradients
What is this talk about? Introduction Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks? Outline (Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition. 2 / 14 • A simple example • The basic setup of supervised Machine Learning • Stochastic Gradient Descent • Stochastic Conditional Gradient Descent Sebastian Pokutta · Training with Conditional Gradients
can be naturally cast as an optimization problem: y 2 Supervised Machine Learning and ERM A simple example (linReg) 2 X 2 y i x i k i The search for the best [Wikipedia] or in matrix form: Find: Given: Consider the following simple learning problem, a.k.a. linear regression: 3 / 14 Set of points X � { x 1 ,. . ., x k } ⊆ R n Vector y � ( y 1 ,. . ., y k ) ∈ R k Linear function θ ∈ R n , such that x i θ ≈ y i ∀ i ∈ [ k ] , X θ ≈ y . Sebastian Pokutta · Training with Conditional Gradients
Supervised Machine Learning and ERM or in matrix form: (linReg) 2 A simple example [Wikipedia] Find: Given: Consider the following simple learning problem, a.k.a. linear regression: 3 / 14 Set of points X � { x 1 ,. . ., x k } ⊆ R n Vector y � ( y 1 ,. . ., y k ) ∈ R k Linear function θ ∈ R n , such that x i θ ≈ y i ∀ i ∈ [ k ] , X θ ≈ y . The search for the best θ can be naturally cast as an optimization problem: � | x i θ − y i | 2 = min min � X θ − y � 2 θ θ i ∈[ k ] Sebastian Pokutta · Training with Conditional Gradients
The ERM approximates the General Risk Minimization problem: Supervised Machine Learning and ERM f x [ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] . This bound is typically very loose. with probability 1 1 L L (ERM) is a good approximation to a solution to (GRM): is chosen large enough, under relatively mild assumptions, a solution to Note: If (GRM) y x y Empirical Risk Minimization L (ERM) 1 More generally, interested in the Empirical Risk Minimization problem: 4 / 14 � min θ L ( θ ) � min ℓ ( f ( x ,θ ) , y ) . |D| θ ( x , y )∈D Sebastian Pokutta · Training with Conditional Gradients
Supervised Machine Learning and ERM Empirical Risk Minimization [ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] . This bound is typically very loose. with probability 1 1 L L (ERM) is a good approximation to a solution to (GRM): is chosen large enough, under relatively mild assumptions, a solution to Note: If (GRM) 4 / 14 1 More generally, interested in the Empirical Risk Minimization problem: (ERM) � min θ L ( θ ) � min ℓ ( f ( x ,θ ) , y ) . |D| θ ( x , y )∈D The ERM approximates the General Risk Minimization problem: � min L ( θ ) � min D ℓ ( f ( x ,θ ) , y ) . θ E ( x , y )∈ � θ Sebastian Pokutta · Training with Conditional Gradients
Supervised Machine Learning and ERM (ERM) [ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] (ERM) is a good approximation to a solution to (GRM): (GRM) Empirical Risk Minimization 4 / 14 1 More generally, interested in the Empirical Risk Minimization problem: � min θ L ( θ ) � min ℓ ( f ( x ,θ ) , y ) . |D| θ ( x , y )∈D The ERM approximates the General Risk Minimization problem: � min L ( θ ) � min D ℓ ( f ( x ,θ ) , y ) . θ E ( x , y )∈ � θ Note: If D is chosen large enough, under relatively mild assumptions, a solution to � log | Θ | + log 1 � δ L ( θ ) ≤ L ( θ ) + , |D| with probability 1 − δ . This bound is typically very loose. Sebastian Pokutta · Training with Conditional Gradients
z i y i C y i c z i c and, e.g., z i z i y i z i and z i z i y i some loss function and z i x i neural network with weights x i 0 1 f Supervised Machine Learning and ERM x i 4. Neural Networks 1 f ...and many more choices and combinations possible. y i 0 1 z i Empirical Risk Minimization: Examples y i 3. Support Vector Machines (or a neural network) x i x i f c 2. Classifjcation / Logistic Regression over classes C 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ Sebastian Pokutta · Training with Conditional Gradients
z i y i z i and z i z i y i some loss function and z i x i neural network with weights Supervised Machine Learning and ERM 0 1 ...and many more choices and combinations possible. f 4. Neural Networks x i x i f y i Empirical Risk Minimization: Examples 1 z i 0 1 y i 3. Support Vector Machines 2. Classifjcation / Logistic Regression over classes C 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) � − � c ∈[ C ] y i , c log z i , c and, e.g., z i = f ( θ, x i ) � x i θ (or a neural network) Sebastian Pokutta · Training with Conditional Gradients
z i y i some loss function and z i x i neural network with weights Supervised Machine Learning and ERM Empirical Risk Minimization: Examples 2. Classifjcation / Logistic Regression over classes C 3. Support Vector Machines 4. Neural Networks f ...and many more choices and combinations possible. 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) � − � c ∈[ C ] y i , c log z i , c and, e.g., z i = f ( θ, x i ) � x i θ (or a neural network) ℓ ( z i , y i ) � y i max ( 0 , 1 − z i ) + ( 1 − y i ) max ( 0 , 1 + z i ) and z i = f ( θ, x i ) � x i θ Sebastian Pokutta · Training with Conditional Gradients
Supervised Machine Learning and ERM Empirical Risk Minimization: Examples 2. Classifjcation / Logistic Regression over classes C 3. Support Vector Machines 4. Neural Networks ...and many more choices and combinations possible. 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) � − � c ∈[ C ] y i , c log z i , c and, e.g., z i = f ( θ, x i ) � x i θ (or a neural network) ℓ ( z i , y i ) � y i max ( 0 , 1 − z i ) + ( 1 − y i ) max ( 0 , 1 + z i ) and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) some loss function and z i = f ( θ, x i ) neural network with weights θ Sebastian Pokutta · Training with Conditional Gradients
Simple idea: Gradient Descent Unfortunately, this might be too expensive if (ERM) has a lot of summands. Optimizing the ERM Problem 1 (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) y f x x y f x y Stochastic Gradient Descent x y 1 L However, reexamine: (GD) t L t t 1 [see blog for background on conv opt] How to solve Problem (ERM)? 6 / 14 Sebastian Pokutta · Training with Conditional Gradients
Optimizing the ERM Problem Stochastic Gradient Descent (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) y f x x y 1 y f x x y 1 L However, reexamine: Unfortunately, this might be too expensive if (ERM) has a lot of summands. (GD) [see blog for background on conv opt] How to solve Problem (ERM)? 6 / 14 Simple idea: Gradient Descent θ t + 1 ← θ t − η ∇ L ( θ t ) Sebastian Pokutta · Training with Conditional Gradients
Optimizing the ERM Problem Stochastic Gradient Descent (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) y f x x y 1 y f x x y 1 L However, reexamine: (GD) [see blog for background on conv opt] How to solve Problem (ERM)? 6 / 14 Simple idea: Gradient Descent θ t + 1 ← θ t − η ∇ L ( θ t ) Unfortunately, this might be too expensive if (ERM) has a lot of summands. Sebastian Pokutta · Training with Conditional Gradients
Optimizing the ERM Problem Stochastic Gradient Descent (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) 1 6 / 14 (GD) How to solve Problem (ERM)? [see blog for background on conv opt] However, reexamine: Simple idea: Gradient Descent θ t + 1 ← θ t − η ∇ L ( θ t ) Unfortunately, this might be too expensive if (ERM) has a lot of summands. � � ∇ L ( θ ) = ∇ 1 ℓ ( f ( x ,θ ) , y ) = ∇ ℓ ( f ( x ,θ ) , y ) , |D| |D| ( x , y )∈D ( x , y )∈D Sebastian Pokutta · Training with Conditional Gradients
Recommend
More recommend