Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace

Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll spend a day going over some of the fundamentals of NNs and modern libraries (we saw a preview last time, with auto-diff)! • This will also serve as a refresher of gradient descent

Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll go over some of the fundamentals of NNs and modern libraries (we saw a preview last week, with auto-diff)! • This will also serve as a refresher of gradient descent

Neural Networks! • In 2020, neural networks are the dominant technology in machine learning (for better or worse)! • Today, we’ll go over some of the fundamentals of NNs and modern libraries (we saw a preview last week, with auto-diff)! • This will also serve as a refresher on gradient descent

Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop

Gradient Descent in Linear Models Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/ optimization We’ll start with linear models, review gradient descent , and then talk about neural nets + backprop

Loss The simplest loss is probably 0/1 loss: 0 if we’re correct 1 if we’re wrong What’s an algo that minimizes this?

The Perceptron !

Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise

Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted)

Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted) The learning problem is to estimate w

Training data x y Consider a simple linear model with parameters w ifw.x.IT is 1 otherwise (assumes bias term moved into x or omitted) The learning problem is to estimate w What is our criterion for a good w ? Minimal loss

Perceptron! Algorithm 5 P erceptron T rain ( D , MaxIter ) 1 : w d ← 0 , for all d = 1 . . . D // initialize weights 2 : b ← 0 // initialize bias 3 : for iter = 1 . . . MaxIter do for all ( x , y ) ∈ D do 4 : a ← ∑ D d = 1 w d x d + b // compute activation for this example 5 : if ya ≤ 0 then 6 : w d ← w d + yx d , for all d = 1 . . . D // update weights 7 : b ← b + y // update bias 8 : end if 9 : end for 10 : 11 : end for 12 : return w 0 , w 1 , . . . , w D , b Fig and Alg from CIML [Daume]

Problems with 0/1 loss • If we’re wrong by .0001 it is “as bad” as being wrong by .9999 • Because it is discrete, optimization is hard if the instances are not linearly separable

Smooth loss Idea: Introduce a “smooth” loss function to make optimization easier Example: Hinge loss 1 loss 0 Hinge Max l 2 z o i g Langley 2 w Xi raw yE I output I 2 correct wrong which signed margin

Losses · ` (0/1) ( y , ˆ y ) = 1 [ y ˆ y ≤ 0 ] Zero/one: ` (hin) ( y , ˆ y ) = max { 0, 1 − y ˆ y } Hinge: 1 ` (log) ( y , ˆ y ) = log 2 log ( 1 + exp [ − y ˆ y ]) Logistic: ` (exp) ( y , ˆ y ) = exp [ − y ˆ y ] Exponential: y ) 2 ` (sqr) ( y , ˆ y ) = ( y − ˆ Squared: Fig and Eq’s from CIML [Daume]

Regularization ∑ ` ( y n , w · x n + b ) + l R ( w , b ) min w , b n

Regularization ∑ ` ( y n , w · x n + b ) + l R ( w , b ) min w , b n Prevent w from “getting to crazy”

Gradient descent By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

Algorithm 21 G radient D escent ( F , K , η 1 , . . . ) 1 : z (0) h 0 , 0 , . . . , 0 i // initialize variable we are optimizing 2 : for k = 1 . . . K do g (k) r z F| z (k-1) // compute gradient at current location 3 : z (k) z (k-1) � η (k) g (k) // take a step down the gradient 4 : 5 : end for 6 : return z (K) Alg from CIML [Daume]

⇤ + r w ⇥ � y n ( w · x n + b ) λ 2 || w || 2 r w L = r w ∑ exp ( n ⇤ + λ w ⇥ � y n ( w · x n + b ) = ∑ ( r w � y n ( w · x n + b )) exp n ( ⇤ + λ w ⇥ � y n ( w · x n + b ) = � ∑ y n x n exp ( n

Limitations of linear models

Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y

Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y (Non-linear) activation functions

Neural networks Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations ( x ) nor outputs ( y ) E E E W1 v k wa b h o I WZ wut but y ol i y The challenge: How do we update weights associated with each node in this multi-layer regime?

back-propagation = gradient descent + chain rule

Algorithm 27 F orward P ropagation ( x ) 1 : for all input nodes u do h u ← corresponding feature of x 2 : 3 : end for 4 : for all nodes v in the network whose parent’s are computed do a v ← ∑ u ∈ par ( v ) w ( u , v ) h u 5 : h v ← tanh ( a v ) 6 : 7 : end for 8 : return a y Tanh is another common activation function

Algorithm 28 B ack P ropagation ( x , y ) 1 : run F orward P ropagation ( x ) to compute activations 2 : e y ← y − a y // compute overall network error 3 : for all nodes v in the network whose error e v is computed do for all u ∈ par ( v ) do 4 : g u , v ← − e v h u // compute gradient of this edge 5 : e u ← e u + e v w u , v ( 1 − tanh 2 ( a u )) // compute the “error” of the parent node 6 : end for 7 : 8 : end for 9 : return all gradients g e

What are we doing with these gradients again?

Gradient descent By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

Neural Networks! If you’re interested in learning more…

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural Networks! In 2020, neural networks are the dominant technology in machine learning (for better or worse)! Today, well spend a day going

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Prevalence of Osteomyelitis Osteomyelitis 10-15% mild infections 50% severe infections

Probing Fundamental Physics with the Radio Sky Amanda Weltman University of Cape Town YITP

Math 211 Math 211 Lecture #18 October 31, 2000 2 Complex Numbers Complex Numbers A complex

Damping of MHD waves in the solar partially ionized plasmas M. L. Khodachenko Space Research

Divided symmetrization, quasisymmetric functions and Schubert polynomials Vasu Tewari University

Database Systems Seminar Senthil Kumar Gurusamy 2 Papers Compiling Mappings to Bridge

Nonparametric Directional Perception Julian Straub Collaborators: Oren Freifeld, Jason Chang,

Stochastic optimization for the crude oil procurement problem Thomas Martin, Michel De Lara