CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse - PowerPoint PPT Presentation

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21

Overview Recall the simple neuron-like unit: y i'th weight bias output output � � w 1 w 2 � weights w 3 y = g b + x i w i inputs i x 1 x 2 x 3 nonlinearity i'th input These units are much more powerful if we connect many of them into a neural network. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 2 / 21

Overview Design choices so far Task: regression, binary classification, multiway classification Model/Architecture: linear, log-linear, feed-forward neural network Loss function: squared error, 0–1 loss, cross-entropy, hinge loss Optimization algorithm: direct solution, gradient descent, perceptron Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 3 / 21

Multilayer Perceptrons We can connect lots of units together into a directed acyclic graph. This gives a feed-forward neural network. That’s in contrast to recurrent neural networks, which can have cycles. (We’ll talk about those later.) Typically, units are grouped together into layers. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 4 / 21

Multilayer Perceptrons Each layer connects N input units to M output units. In the simplest case, all input units are connected to all output units. We call this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and outputs to the network. Recall from multiway logistic regression: this means we need an M × N weight matrix. The output units are a function of the input units: y = f ( x ) = φ ( Wx + b ) A multilayer network consisting of fully connected layers is called a multilayer perceptron. Despite the name, it has nothing to do with perceptrons! Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 5 / 21

Multilayer Perceptrons Some activation functions: Rectified Linear Unit Linear Soft ReLU (ReLU) y = log 1 + e z y = z y = max(0 , z ) Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 6 / 21

Multilayer Perceptrons Some activation functions: Hyperbolic Tangent Hard Threshold Logistic (tanh) � 1 if z > 0 1 y = e z − e − z y = y = 0 if z ≤ 0 1 + e − z e z + e − z Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 7 / 21

Multilayer Perceptrons Designing a network to compute XOR: Assume hard threshold activation function Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 8 / 21

Multilayer Perceptrons Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 9 / 21

Multilayer Perceptrons Each layer computes a function, so the network computes a composition of functions: h (1) = f (1) ( x ) h (2) = f (2) ( h (1) ) . . . y = f ( L ) ( h ( L − 1) ) Or more simply: y = f ( L ) ◦ · · · ◦ f (1) ( x ) . Neural nets provide modularity: we can implement each layer’s computations as a black box. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 10 / 21

Feature Learning Neural nets can be viewed as a way of learning features: Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 11 / 21

Feature Learning Neural nets can be viewed as a way of learning features: The goal: Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 11 / 21

Feature Learning Input representation of a digit : 784 dimensional vector. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 12 / 21

Feature Learning Each first-layer hidden unit computes σ ( w T i x ) Here is one of the weight vectors (also called a feature). It’s reshaped into an image, with gray = 0, white = +, black = -. To compute w T i x , multiply the corresponding pixels, and sum the result. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 13 / 21

Feature Learning There are 256 first-level features total. Here are some of them. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 14 / 21

Levels of Abstraction The psychological profiling [of a programmer] is mostly the ability to shift levels of abstraction, from low level to high level. To see something in the small and to see something in the large. – Don Knuth Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 15 / 21

Levels of Abstraction When you design neural networks and machine learning algorithms, you’ll need to think at multiple levels of abstraction. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 16 / 21

Expressive Power We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better? Any sequence of linear layers can be equivalently represented with a single linear layer. y = W (3) W (2) W (1) x � �� W ′ Deep linear networks are no more expressive than linear regression! Linear layers do have their uses — stay tuned! Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 17 / 21

Expressive Power Multilayer feed-forward neural nets with nonlinear activation functions are universal approximators: they can approximate any function arbitrarily well. This has been shown for various activation functions (thresholds, logistic, ReLU, etc.) Even though ReLU is “almost” linear, it’s nonlinear enough! Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 18 / 21

Expressive Power Universality for binary inputs and targets: Hard threshold hidden units, linear output Strategy: 2 D hidden units, each of which responds to one particular input configuration Only requires one hidden layer, though it needs to be extremely wide! Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 19 / 21

Expressive Power What about the logistic activation function? You can approximate a hard threshold by scaling up the weights and biases: y = σ ( x ) y = σ (5 x ) This is good: logistic units are differentiable, so we can tune them with gradient descent. (Stay tuned!) Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 20 / 21

Expressive Power Limits of universality You may need to represent an exponentially large network. If you can learn any function, you’ll just overfit. Really, we desire a compact representation! Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 21 / 21

Expressive Power Limits of universality You may need to represent an exponentially large network. If you can learn any function, you’ll just overfit. Really, we desire a compact representation! We’ve derived units which compute the functions AND, OR, and NOT. Therefore, any Boolean circuit can be translated into a feed-forward neural net. This suggests you might be able to learn compact representations of some complicated functions The view of neural nets as “differentiable computers” is starting to take hold. More about this when we talk about recurrent neural nets. Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 21 / 21

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse - PowerPoint PPT Presentation

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y i'th weight bias output output w 1 w 2 weights w 3 y = g b +

Perceptrons Introduction: Neural Networks 1 The Perceptron 2 Using Perceptrons Perceptrons

CSC421/2516 Lecture 3: Multilayer Perceptrons Roger Grosse and Jimmy Ba Roger Grosse and Jimmy

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Perceptrons Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction: Neural Networks The

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

CSC321 Lecture 1: Introduction Roger Grosse Roger Grosse CSC321 Lecture 1: Introduction 1 / 26

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

CSC321 Lecture 16: Learning Long-Term Dependencies Roger Grosse Roger Grosse CSC321 Lecture 16:

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

CSC321 Lecture 23: Go Roger Grosse Roger Grosse CSC321 Lecture 23: Go 1 / 22 Final Exam

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain

Differential Privacy Privacy & Fairness in Data Science CS848 Fall 2019 2 Outline

CS3000: Algorithms & Data Jonathan Ullman Lecture 16: Applications of Network Flow

National Healthcare Safety Network (NHSN) NHSN Analysis: Advanced Features & Terminology

Round-Optimal Secure Multiparty Computation with Honest Majority Prabhanjan Ananth Arka Rai

Generating output in the COMIC multimodal dialogue system Mary Ellen Foster School of

Lecture 7: Convolutional Networks Justin Johnson Lecture 7 - 1 September 24, 2019 Reminder: A2

A Linearised Input-Output Representation for Control Synthesis in Flexible Multibody System A

Sambuz

Useful Links

Newsletter

Mail Us

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse - PowerPoint PPT Presentation

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y i'th weight bias output output w 1 w 2 weights w 3 y = g b +

Perceptrons Introduction: Neural Networks 1 The Perceptron 2 Using Perceptrons Perceptrons

CSC421/2516 Lecture 3: Multilayer Perceptrons Roger Grosse and Jimmy Ba Roger Grosse and Jimmy

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Perceptrons Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction: Neural Networks The

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

CSC321 Lecture 1: Introduction Roger Grosse Roger Grosse CSC321 Lecture 1: Introduction 1 / 26

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

CSC321 Lecture 16: Learning Long-Term Dependencies Roger Grosse Roger Grosse CSC321 Lecture 16:

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

CSC321 Lecture 23: Go Roger Grosse Roger Grosse CSC321 Lecture 23: Go 1 / 22 Final Exam

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain

Differential Privacy Privacy &amp; Fairness in Data Science CS848 Fall 2019 2 Outline

CS3000: Algorithms &amp; Data Jonathan Ullman Lecture 16: Applications of Network Flow

National Healthcare Safety Network (NHSN) NHSN Analysis: Advanced Features &amp; Terminology

Round-Optimal Secure Multiparty Computation with Honest Majority Prabhanjan Ananth Arka Rai

Generating output in the COMIC multimodal dialogue system Mary Ellen Foster School of

Lecture 7: Convolutional Networks Justin Johnson Lecture 7 - 1 September 24, 2019 Reminder: A2

A Linearised Input-Output Representation for Control Synthesis in Flexible Multibody System A

Sambuz

Useful Links

Newsletter

Mail Us

Differential Privacy Privacy & Fairness in Data Science CS848 Fall 2019 2 Outline

CS3000: Algorithms & Data Jonathan Ullman Lecture 16: Applications of Network Flow

National Healthcare Safety Network (NHSN) NHSN Analysis: Advanced Features & Terminology