MATH Opportunities in Berlin Shameless plug Postdoc and PhD - PowerPoint PPT Presentation

Robust ML Training with Conditional Gradients Sebastian Pokutta Technische Universität Berlin and Zuse Institute Berlin pokutta@math.tu-berlin.de @spokutta CO@Work 2020 Summer School September, 2020 Berlin Mathematics Research Center MATH

Opportunities in Berlin Shameless plug Postdoc and PhD positions in optimization/ML. At Zuse Institute Berlin and TU Berlin . 1 / 14 Sebastian Pokutta · Training with Conditional Gradients

What is this talk about? Introduction Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks? Outline A simple example The basic setup of supervised Machine Learning Stochastic Gradient Descent Stochastic Conditional Gradient Descent (Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition. 2 / 14 Sebastian Pokutta · Training with Conditional Gradients

What is this talk about? Introduction Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks? Outline (Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition. 2 / 14 • A simple example • The basic setup of supervised Machine Learning • Stochastic Gradient Descent • Stochastic Conditional Gradient Descent Sebastian Pokutta · Training with Conditional Gradients

can be naturally cast as an optimization problem: y 2 Supervised Machine Learning and ERM A simple example (linReg) 2 X 2 y i x i k i The search for the best [Wikipedia] or in matrix form: Find: Given: Consider the following simple learning problem, a.k.a. linear regression: 3 / 14 Set of points X � { x 1 ,. . ., x k } ⊆ R n Vector y � ( y 1 ,. . ., y k ) ∈ R k Linear function θ ∈ R n , such that x i θ ≈ y i ∀ i ∈ [ k ] , X θ ≈ y . Sebastian Pokutta · Training with Conditional Gradients

Supervised Machine Learning and ERM or in matrix form: (linReg) 2 A simple example [Wikipedia] Find: Given: Consider the following simple learning problem, a.k.a. linear regression: 3 / 14 Set of points X � { x 1 ,. . ., x k } ⊆ R n Vector y � ( y 1 ,. . ., y k ) ∈ R k Linear function θ ∈ R n , such that x i θ ≈ y i ∀ i ∈ [ k ] , X θ ≈ y . The search for the best θ can be naturally cast as an optimization problem: � | x i θ − y i | 2 = min min � X θ − y � 2 θ θ i ∈[ k ] Sebastian Pokutta · Training with Conditional Gradients

The ERM approximates the General Risk Minimization problem: Supervised Machine Learning and ERM f x [ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] . This bound is typically very loose. with probability 1 1 L L (ERM) is a good approximation to a solution to (GRM): is chosen large enough, under relatively mild assumptions, a solution to Note: If (GRM) y x y Empirical Risk Minimization L (ERM) 1 More generally, interested in the Empirical Risk Minimization problem: 4 / 14 � min θ L ( θ ) � min ℓ ( f ( x ,θ ) , y ) . |D| θ ( x , y )∈D Sebastian Pokutta · Training with Conditional Gradients

Supervised Machine Learning and ERM Empirical Risk Minimization [ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] . This bound is typically very loose. with probability 1 1 L L (ERM) is a good approximation to a solution to (GRM): is chosen large enough, under relatively mild assumptions, a solution to Note: If (GRM) 4 / 14 1 More generally, interested in the Empirical Risk Minimization problem: (ERM) � min θ L ( θ ) � min ℓ ( f ( x ,θ ) , y ) . |D| θ ( x , y )∈D The ERM approximates the General Risk Minimization problem: � min L ( θ ) � min D ℓ ( f ( x ,θ ) , y ) . θ E ( x , y )∈ � θ Sebastian Pokutta · Training with Conditional Gradients

Supervised Machine Learning and ERM (ERM) [ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] (ERM) is a good approximation to a solution to (GRM): (GRM) Empirical Risk Minimization 4 / 14 1 More generally, interested in the Empirical Risk Minimization problem: � min θ L ( θ ) � min ℓ ( f ( x ,θ ) , y ) . |D| θ ( x , y )∈D The ERM approximates the General Risk Minimization problem: � min L ( θ ) � min D ℓ ( f ( x ,θ ) , y ) . θ E ( x , y )∈ � θ Note: If D is chosen large enough, under relatively mild assumptions, a solution to � log | Θ | + log 1 � δ L ( θ ) ≤ L ( θ ) + , |D| with probability 1 − δ . This bound is typically very loose. Sebastian Pokutta · Training with Conditional Gradients

z i y i C y i c z i c and, e.g., z i z i y i z i and z i z i y i some loss function and z i x i neural network with weights x i 0 1 f Supervised Machine Learning and ERM x i 4. Neural Networks 1 f ...and many more choices and combinations possible. y i 0 1 z i Empirical Risk Minimization: Examples y i 3. Support Vector Machines (or a neural network) x i x i f c 2. Classifjcation / Logistic Regression over classes C 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ Sebastian Pokutta · Training with Conditional Gradients

z i y i z i and z i z i y i some loss function and z i x i neural network with weights Supervised Machine Learning and ERM 0 1 ...and many more choices and combinations possible. f 4. Neural Networks x i x i f y i Empirical Risk Minimization: Examples 1 z i 0 1 y i 3. Support Vector Machines 2. Classifjcation / Logistic Regression over classes C 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) � − � c ∈[ C ] y i , c log z i , c and, e.g., z i = f ( θ, x i ) � x i θ (or a neural network) Sebastian Pokutta · Training with Conditional Gradients

z i y i some loss function and z i x i neural network with weights Supervised Machine Learning and ERM Empirical Risk Minimization: Examples 2. Classifjcation / Logistic Regression over classes C 3. Support Vector Machines 4. Neural Networks f ...and many more choices and combinations possible. 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) � − � c ∈[ C ] y i , c log z i , c and, e.g., z i = f ( θ, x i ) � x i θ (or a neural network) ℓ ( z i , y i ) � y i max ( 0 , 1 − z i ) + ( 1 − y i ) max ( 0 , 1 + z i ) and z i = f ( θ, x i ) � x i θ Sebastian Pokutta · Training with Conditional Gradients

Supervised Machine Learning and ERM Empirical Risk Minimization: Examples 2. Classifjcation / Logistic Regression over classes C 3. Support Vector Machines 4. Neural Networks ...and many more choices and combinations possible. 5 / 14 1. Linear Regression ℓ ( z i , y i ) � | z i − y i | 2 and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) � − � c ∈[ C ] y i , c log z i , c and, e.g., z i = f ( θ, x i ) � x i θ (or a neural network) ℓ ( z i , y i ) � y i max ( 0 , 1 − z i ) + ( 1 − y i ) max ( 0 , 1 + z i ) and z i = f ( θ, x i ) � x i θ ℓ ( z i , y i ) some loss function and z i = f ( θ, x i ) neural network with weights θ Sebastian Pokutta · Training with Conditional Gradients

Simple idea: Gradient Descent Unfortunately, this might be too expensive if (ERM) has a lot of summands. Optimizing the ERM Problem 1 (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) y f x x y f x y Stochastic Gradient Descent x y 1 L However, reexamine: (GD) t L t t 1 [see blog for background on conv opt] How to solve Problem (ERM)? 6 / 14 Sebastian Pokutta · Training with Conditional Gradients

Optimizing the ERM Problem Stochastic Gradient Descent (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) y f x x y 1 y f x x y 1 L However, reexamine: Unfortunately, this might be too expensive if (ERM) has a lot of summands. (GD) [see blog for background on conv opt] How to solve Problem (ERM)? 6 / 14 Simple idea: Gradient Descent θ t + 1 ← θ t − η ∇ L ( θ t ) Sebastian Pokutta · Training with Conditional Gradients

Optimizing the ERM Problem Stochastic Gradient Descent (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) y f x x y 1 y f x x y 1 L However, reexamine: (GD) [see blog for background on conv opt] How to solve Problem (ERM)? 6 / 14 Simple idea: Gradient Descent θ t + 1 ← θ t − η ∇ L ( θ t ) Unfortunately, this might be too expensive if (ERM) has a lot of summands. Sebastian Pokutta · Training with Conditional Gradients

Optimizing the ERM Problem Stochastic Gradient Descent (gradEst) y f x x y L uniformly at random, then Thus if we sample x y (ERMgrad) 1 6 / 14 (GD) How to solve Problem (ERM)? [see blog for background on conv opt] However, reexamine: Simple idea: Gradient Descent θ t + 1 ← θ t − η ∇ L ( θ t ) Unfortunately, this might be too expensive if (ERM) has a lot of summands. � � ∇ L ( θ ) = ∇ 1 ℓ ( f ( x ,θ ) , y ) = ∇ ℓ ( f ( x ,θ ) , y ) , |D| |D| ( x , y )∈D ( x , y )∈D Sebastian Pokutta · Training with Conditional Gradients

MATH Opportunities in Berlin Shameless plug Postdoc and PhD - PowerPoint PPT Presentation

Robust ML Training with Conditional Gradients Sebastian Pokutta Technische Universitt Berlin and Zuse Institute Berlin pokutta@math.tu-berlin.de @spokutta CO@Work 2020 Summer School September, 2020 Berlin Mathematics Research Center MATH

GUST e-Foundry MATH FONTS Latin Modern Math, ver. 1.959 T EX Gyre Bonum Math, ver. 1.005 T EX

Math 211 Math 211 Lecture #1 August 29, 2000 2 Welcome to Math 211 Welcome to Math 211 Math

EF BERLIN Opened in 2015 EF BERLIN New EF Centre 2015 > Located in the heart of Berlin

GRADUATION REQUIREMENTS English 4 Credits -I, II, III, IV Math 4 Credits Math I, Math II,

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Math Fun For Everyone! 1 Mini Math Attitude Inventory 1. I liked Math... A. A Lot B. A

Math 211 Math 211 Lecture #1 Introduction August 27, 2001 2 Welcome to Math 211 Welcome to

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Towards Distributed and Reliable Software Defined Networking Marco Canini (TU Berlin & T-Labs

The Effects of IT on Democracy Kate Gray (reformed political junkie) @grisgraygrau GOTO Berlin

Berlin Chen, berlin@csie.ntnu.edu.tw

7th Grade Math Placement & Math Pathways Outcomes Review math placement test logistics

MATH Placement Which Math is Appropriate for your Major? Thomas Harriot College of Arts &

Making Math 20 year math teacher Differentiate between a concept and a skills

Elementary Math Sitton Elementary School May 2, 2017 Mindset What is YOUR experience with math?

7th Grade Math Placement & Math Pathways Math Program Manager slemke@fremont.k12.ca.us

Model selection theory: a tutorial with applications to learning Pascal Massart Universit

PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, Pompeu Fabra University,

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman

If You Give a Judge a Risk Score Evidence from Kentucky Bail Decisions Alex Albright, Harvard

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

The Quanto Theory of Exchange Rates Lukas Kremens Ian Martin April, 2018 Kremens & Martin

MATH Opportunities in Berlin Shameless plug Postdoc and PhD - PowerPoint PPT Presentation

Robust ML Training with Conditional Gradients Sebastian Pokutta Technische Universitt Berlin and Zuse Institute Berlin pokutta@math.tu-berlin.de @spokutta CO@Work 2020 Summer School September, 2020 Berlin Mathematics Research Center MATH

GUST e-Foundry MATH FONTS Latin Modern Math, ver. 1.959 T EX Gyre Bonum Math, ver. 1.005 T EX

Math 211 Math 211 Lecture #1 August 29, 2000 2 Welcome to Math 211 Welcome to Math 211 Math

EF BERLIN Opened in 2015 EF BERLIN New EF Centre 2015 &gt; Located in the heart of Berlin

GRADUATION REQUIREMENTS English 4 Credits -I, II, III, IV Math 4 Credits Math I, Math II,

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Math Fun For Everyone! 1 Mini Math Attitude Inventory 1. I liked Math... A. A Lot B. A

Math 211 Math 211 Lecture #1 Introduction August 27, 2001 2 Welcome to Math 211 Welcome to

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Towards Distributed and Reliable Software Defined Networking Marco Canini (TU Berlin &amp; T-Labs

The Effects of IT on Democracy Kate Gray (reformed political junkie) @grisgraygrau GOTO Berlin

Berlin Chen, berlin@csie.ntnu.edu.tw

7th Grade Math Placement &amp; Math Pathways Outcomes Review math placement test logistics

MATH Placement Which Math is Appropriate for your Major? Thomas Harriot College of Arts &amp;

Making Math 20 year math teacher Differentiate between a concept and a skills

Elementary Math Sitton Elementary School May 2, 2017 Mindset What is YOUR experience with math?

7th Grade Math Placement &amp; Math Pathways Math Program Manager slemke@fremont.k12.ca.us

Model selection theory: a tutorial with applications to learning Pascal Massart Universit

PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to

Mean estimation: median-of-means tournaments G abor Lugosi ICREA, Pompeu Fabra University,

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman

If You Give a Judge a Risk Score Evidence from Kentucky Bail Decisions Alex Albright, Harvard

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

The Quanto Theory of Exchange Rates Lukas Kremens Ian Martin April, 2018 Kremens &amp; Martin

EF BERLIN Opened in 2015 EF BERLIN New EF Centre 2015 > Located in the heart of Berlin

Towards Distributed and Reliable Software Defined Networking Marco Canini (TU Berlin & T-Labs

7th Grade Math Placement & Math Pathways Outcomes Review math placement test logistics

MATH Placement Which Math is Appropriate for your Major? Thomas Harriot College of Arts &

7th Grade Math Placement & Math Pathways Math Program Manager slemke@fremont.k12.ca.us

The Quanto Theory of Exchange Rates Lukas Kremens Ian Martin April, 2018 Kremens & Martin