TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO - PowerPoint PPT Presentation

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS Surbhi Goel Sushrut Karmalkar Adam Klivans The University of Texas at Austin 1

̂ ̂ WHAT IS RELU REGRESSION? Given : Samples drawn from distribution with arbitrary labels 𝒠 Output : w ∈ ℝ d such that 𝒠 [ ( 𝗌𝖿𝗆𝗏 ( ] ≤ 𝗉𝗊𝗎 + ϵ 2 w ⋅ x ) − y ) 𝔽 test error w ( 𝔽 ] ) 𝒠 [ ( 𝗌𝖿𝗆𝗏 ( w ⋅ x ) − y ) 2 𝗉𝗊𝗎 := min loss of the best-fitting ReLU 𝗌𝖿𝗆𝗏 ( a ) = max(0, a ) 2

̂ ̂ WHAT IS RELU REGRESSION? Given : Samples drawn from distribution with arbitrary labels 𝒠 Output : w ∈ ℝ d such that 𝒠 [ ( 𝗌𝖿𝗆𝗏 ( ] ≤ 𝗉𝗊𝗎 + ϵ 2 w ⋅ x ) − y ) 𝔽 test error w ( 𝔽 ] ) 𝒠 [ ( 𝗌𝖿𝗆𝗏 ( w ⋅ x ) − y ) 2 𝗉𝗊𝗎 := min loss of the best-fitting ReLU 𝗌𝖿𝗆𝗏 ( a ) = max(0, a ) The underlying optimization problem is non-convex! 2

PRIOR WORK - POSITIVE Mean-zero noise: Isotonic regression over Sphere [Kalai-Sastry’08, Kakade- Kalai-Kanade-Shamir’11] Noiseless: Gradient descent over Gaussian input [Soltanolkotabi’17] 3

PRIOR WORK - POSITIVE Mean-zero noise: Isotonic regression over Sphere [Kalai-Sastry’08, Kakade- Kalai-Kanade-Shamir’11] Noiseless: Gradient descent over Gaussian input [Soltanolkotabi’17] Results require strong restrictions on the input or the label 3

PRIOR WORK - NEGATIVE Minimizing training loss is NP-hard [Manurangsi-Reichman’18] Hardness over uniform on the boolean cube [ G -Kanade- K -Thaler’17] 4

PRIOR WORK - NEGATIVE Minimizing training loss is NP-hard [Manurangsi-Reichman’18] Hardness over uniform on the boolean cube [ G -Kanade- K -Thaler’17] Results use special discrete distributions to prove hardness 4

DISTRIBUTION ASSUMPTION Assumption: For all ( x , y ) ∼ 𝒠 x ∼ 𝒪 (0, I d ) and y ∈ [0,1] , 5

DISTRIBUTION ASSUMPTION Assumption: For all ( x , y ) ∼ 𝒠 x ∼ 𝒪 (0, I d ) and y ∈ [0,1] , Gaussian input allows for positive results in noiseless setting [Tian’17, Soltanolkotabi’17, Li-Yuan'17, Zhong-Song-Jain-Bartlett-Dhillon'17, Brutzkus- Globerson’17, Zhong-Song-Dhillon17, Du-Lee-Tian-Poczos-Singh’18, Zhang-Yu-Wang- Gu’19, Fu-Chi-Liang’19…….] 5

DISTRIBUTION ASSUMPTION Assumption: For all ( x , y ) ∼ 𝒠 x ∼ 𝒪 (0, I d ) and y ∈ [0,1] , Gaussian input allows for positive results in noiseless setting [Tian’17, Soltanolkotabi’17, Li-Yuan'17, Zhong-Song-Jain-Bartlett-Dhillon'17, Brutzkus- Globerson’17, Zhong-Song-Dhillon17, Du-Lee-Tian-Poczos-Singh’18, Zhang-Yu-Wang- Gu’19, Fu-Chi-Liang’19…….] Explicitly compute closed-form expressions for loss/gradient 5

HARDNESS RESULT There exists NO algorithm for ReLU regression up to d o ( log(1/ ϵ ) ) error in time under standard ϵ computational hardness assumptions. 6

HARDNESS RESULT There exists NO algorithm for ReLU regression up to d o ( log(1/ ϵ ) ) error in time under standard ϵ computational hardness assumptions. The problem is as hard as learning sparse parities with noise! 6

HARDNESS RESULT There exists NO algorithm for ReLU regression up to d o ( log(1/ ϵ ) ) error in time under standard ϵ computational hardness assumptions. The problem is as hard as learning sparse parities with noise! First hardness result under the Gaussian assumption! 6

HARDNESS FOR GRADIENT DESCENT Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up d o ( log(1/ ϵ ) ) to error with less than queries. ϵ 7

HARDNESS FOR GRADIENT DESCENT Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up d o ( log(1/ ϵ ) ) to error with less than queries. ϵ Gradient Descent (GD) is well-known to be an SQ algorithm 7

HARDNESS FOR GRADIENT DESCENT Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up d o ( log(1/ ϵ ) ) to error with less than queries. ϵ Gradient Descent (GD) is well-known to be an SQ algorithm GD can NOT solve ReLU regression in polynomial time 7

HARDNESS FOR GRADIENT DESCENT Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up d o ( log(1/ ϵ ) ) to error with less than queries. ϵ Gradient Descent (GD) is well-known to be an SQ algorithm GD can NOT solve ReLU regression in polynomial time Recall GD works in noiseless setting [Soltanokotabi’17] 7

APPROXIMATION RESULT There exists an an algorithm for ReLU regression with O ( 𝗉𝗊𝗎 2/3 ) + ϵ poly ( d ,1/ ϵ ) error in time . 8

APPROXIMATION RESULT There exists an an algorithm for ReLU regression with O ( 𝗉𝗊𝗎 2/3 ) + ϵ poly ( d ,1/ ϵ ) error in time . O ( 𝗉𝗊𝗎 ) + ϵ poly ( d ,1/ ϵ ) Can get in time [Diakonikolas- G - K - K -Soltanolkotabi’TBD] 8

APPROXIMATION RESULT There exists an an algorithm for ReLU regression with O ( 𝗉𝗊𝗎 2/3 ) + ϵ poly ( d ,1/ ϵ ) error in time . O ( 𝗉𝗊𝗎 ) + ϵ poly ( d ,1/ ϵ ) Can get in time [Diakonikolas- G - K - K -Soltanolkotabi’TBD] Finding approximate solutions is tractable! 8

THANK YOU! Poster @ East Exhibition Hall B + C #235

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO - PowerPoint PPT Presentation

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS Surbhi Goel Sushrut Karmalkar Adam Klivans The University of Texas at Austin 1 WHAT IS RELU REGRESSION? Given : Samples drawn from distribution

On-line learning in neural networks with ReLU activation Michiel Straat September 19, 2018 1 /

Space/time tradeoffs; dynamic programming; y g g transform and conquer 1. Space/time

Area and Time Tradeoffs in FPGAs Examining the concept of area/time tradeoffs in FPGA design,

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK BACK This Time Last Year...

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom

ReLu and Maxout Networks and Their Possible Connections to Tropical Methods J org Zimmermann,

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity Chulhee

AUTOMATIC TRADEOFFS: ACCURACY AND ENERGY Stephanie Forrest

REDD+ within the WEL nexus Opportunities and tradeoffs Kristy Graham May 2011 Outline What

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Indoor Accuracy Test Bed Framework Indoor Accuracy Test Bed Framework Working Group #3 E911

the myth of accuracy Damian Harty, Lucid Motors the myth of accuracy Its easy to believe

Quantum Time-Space Tradeoffs for Deciding Systems of Linear Inequalities Robert Spalek

On the Invertibility of ReLU Networks Inverse Problems and Machine Learning, Caltech Jens

Re Reverse-Eng Engine neeri ring ng De Deep Re ReLU Ne Networ orks David Rolnick and

Overview of PySpark MLlib BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

High Performance Linear System Solvers with Focus on Graph Laplacians Richard Peng Georgia Tech

Learning with Submodular Functions Francis Bach Sierra project-team, INRIA - Ecole Normale Sup

Investigation of up-and-down strategies for isotonic dose-finding Anastasia Ivanova Department

Introduction to Machine Learning Classification: Logistic Regression

Click to go to website: www.njctl.org Slide 2 / 44 Membranes & Enzymes Multiple Choice

A A Modi dified d Frank nk-Wo Wolfe Algorithm for Te Tensor Fa Factorization with Unimodal

Order Restricted Clustering for Dose- Response Microarray Data Adetayo Kasim Interuniversity

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO - PowerPoint PPT Presentation

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS Surbhi Goel Sushrut Karmalkar Adam Klivans The University of Texas at Austin 1 WHAT IS RELU REGRESSION? Given : Samples drawn from distribution

On-line learning in neural networks with ReLU activation Michiel Straat September 19, 2018 1 /

Space/time tradeoffs; dynamic programming; y g g transform and conquer 1. Space/time

Area and Time Tradeoffs in FPGAs Examining the concept of area/time tradeoffs in FPGA design,

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK BACK This Time Last Year...

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom

ReLu and Maxout Networks and Their Possible Connections to Tropical Methods J org Zimmermann,

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity Chulhee

AUTOMATIC TRADEOFFS: ACCURACY AND ENERGY Stephanie Forrest

REDD+ within the WEL nexus Opportunities and tradeoffs Kristy Graham May 2011 Outline What

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Indoor Accuracy Test Bed Framework Indoor Accuracy Test Bed Framework Working Group #3 E911

the myth of accuracy Damian Harty, Lucid Motors the myth of accuracy Its easy to believe

Quantum Time-Space Tradeoffs for Deciding Systems of Linear Inequalities Robert Spalek

On the Invertibility of ReLU Networks Inverse Problems and Machine Learning, Caltech Jens

Re Reverse-Eng Engine neeri ring ng De Deep Re ReLU Ne Networ orks David Rolnick and

Overview of PySpark MLlib BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

High Performance Linear System Solvers with Focus on Graph Laplacians Richard Peng Georgia Tech

Learning with Submodular Functions Francis Bach Sierra project-team, INRIA - Ecole Normale Sup

Investigation of up-and-down strategies for isotonic dose-finding Anastasia Ivanova Department

Introduction to Machine Learning Classification: Logistic Regression

Click to go to website: www.njctl.org Slide 2 / 44 Membranes &amp; Enzymes Multiple Choice

A A Modi dified d Frank nk-Wo Wolfe Algorithm for Te Tensor Fa Factorization with Unimodal

Order Restricted Clustering for Dose- Response Microarray Data Adetayo Kasim Interuniversity

Click to go to website: www.njctl.org Slide 2 / 44 Membranes & Enzymes Multiple Choice