Multi-Layer Networks and Backpropagation Algorithm M. Soleymani - - PowerPoint PPT Presentation

multi layer networks
SMART_READER_LITE
LIVE PREVIEW

Multi-Layer Networks and Backpropagation Algorithm M. Soleymani - - PowerPoint PPT Presentation

Multi-Layer Networks and Backpropagation Algorithm M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton lectures, NN for Machine Learning


slide-1
SLIDE 1

Multi-Layer Networks and Backpropagation Algorithm

  • M. Soleymani

Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton lectures, “NN for Machine Learning” course, 2015.

slide-2
SLIDE 2

Reasons to study neural computation

  • Neuroscience: To understand how the brain actually works.

– Its very big and very complicated and made of stuff that dies when you poke it around. So we need to use computer simulations.

  • AI: To solve practical problems by using novel learning algorithms

inspired by the brain

– Learning algorithms can be very useful even if they are not how the brain actually works.

slide-3
SLIDE 3
slide-4
SLIDE 4

A typical cortical neuron

  • Gross physical structure:

– There is one axon that branches – There is a dendritic tree that collects input from other neurons.

  • Axons typically contact dendritic trees at synapses

– A spike of activity in the axon causes charge to be injected into the post-synaptic neuron.

  • Spike generation:

– There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at synapses to depolarize the cell membrane.

slide-5
SLIDE 5

A mathematical model for biological neurons

𝑦1 𝑥1 𝑥1𝑦1 𝑥2𝑦2 𝑥3𝑦3

slide-6
SLIDE 6

How the brain works

  • Each neuron receives inputs from other neurons
  • The effect of each input line on the neuron is controlled by a synaptic weight
  • The synaptic weights adapt so that the whole network learns to perform useful

computations

– Recognizing objects, understanding language, making plans, controlling the body.

  • You have about 1011 neurons each with about 104 weights.

– A huge number of weights can affect the computation in a very short time. Much better bandwidth than a workstation.

slide-7
SLIDE 7

Be very careful with your brain analogies!

  • Biological Neurons:

– Many different types – Dendrites can perform complex non-linear computations – Synapses are not a single weight but a complex non-linear dynamical system – Rate code may not be adequate

[Dendritic Computation. London and Hausser]

slide-8
SLIDE 8

Binary threshold neurons

  • McCulloch-Pitts (1943): influenced Von Neumann.

– First compute a weighted sum of the inputs. – send out a spike of activity if the weighted sum exceeds a threshold. – McCulloch and Pitts thought that each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

𝑗𝑜𝑞𝑣𝑢1 𝑗𝑜𝑞𝑣𝑢2 𝑗𝑜𝑞𝑣𝑢𝑒 𝑔

𝑗

𝑥𝑗𝑦𝑗 𝑔: Activation function … 𝑔 Σ 𝑥1 𝑥2 𝑥𝑒

slide-9
SLIDE 9

McCulloch-Pitts neuron: binary threshold

9

  • Neuron, unit, or processing element:

𝑦1 𝑦2 𝑦𝑒 𝑧 … 𝑧 = 1, 𝑨 ≥ 𝜄 0, 𝑨 < 𝜄 𝑧 𝜄: activation threshold 𝑥1 𝑥2 𝑥𝑒 𝑦1 𝑦2 𝑦𝑒 𝑧 … 𝑥1 𝑥2 𝑥𝑒 𝑐 1 𝑧

bias: 𝑐 = −𝜄

Equivalent to binary McCulloch-Pitts neuron

slide-10
SLIDE 10

AND & OR networks

10

  • For -1 and 1 inputs:
slide-11
SLIDE 11

Sigmoid neurons

  • These give a real-valued output that is a smooth and bounded

function of their total input.

  • Typically they use the logistic function

– They have nice derivatives.

slide-12
SLIDE 12

Rectified Linear Units (ReLU)

  • They compute a linear weighted sum of their inputs.
  • The output is a non-linear function of the total input.
slide-13
SLIDE 13

Adjusting weights

  • Types of single layer networks:

–Perceptron (Rosenblatt, 1962) –ADALINE (Widrow and Hoff, 1960)

slide-14
SLIDE 14

The standard Perceptron architecture

  • Learn how to weight each of the

feature activations to get desirable

  • utputs.
  • If output is above some threshold,

decide that the input vector is a positive example of the target class.

slide-15
SLIDE 15

The perceptron convergence procedure

  • Perceptron trains binary output neurons as classifiers
  • Pick training cases (until convergence):

– If the output unit is correct, leave its weights alone. – If the output unit incorrectly outputs a zero, add the input vector to it. – If the output unit incorrectly outputs a 1, subtract the input vector from it.

  • This is guaranteed to find a set of weights that gets the right answer

for all the training cases if any such set exists.

slide-16
SLIDE 16

Adjusting weights

16

  • Weight update for a training pair (𝒚 𝑜 , 𝑧(𝑜)):

– Perceptron: If 𝑡𝑗𝑕𝑜(𝒙𝑈𝒚(𝑜)) ≠ 𝑧(𝑜) then ∆𝒙 = 𝒚(𝑜)𝑧(𝑜) else ∆𝒙 = 𝟏 – ADALINE: ∆𝒙 = 𝜃(𝑧(𝑜) − 𝒙𝑈𝒚(𝑜))𝒚(𝑜)

  • Widrow-Hoff, LMS, or delta rule

𝒙𝑢+1 = 𝒙𝑢 − 𝜃𝛼𝐹𝑜 𝒙𝑢 𝐹𝑜 𝒙 = 𝑧(𝑜) − 𝒙𝑈𝒚(𝑜) 2

slide-17
SLIDE 17

How to learn the weights: multi class example

slide-18
SLIDE 18

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

slide-19
SLIDE 19

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

slide-20
SLIDE 20

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

slide-21
SLIDE 21

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

slide-22
SLIDE 22

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

slide-23
SLIDE 23

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

slide-24
SLIDE 24

Single layer networks as template matching

  • Weights for each class as a template (or sometimes also called a

prototype) for that class.

– The winner is the most similar template.

  • The ways in which hand-written digits vary are much too complicated

to be captured by simple template matches of whole shapes.

  • To capture all the allowable variations of a digit we need to learn the

features that it is composed of.

slide-25
SLIDE 25

The history of perceptrons

  • They were popularised by Frank Rosenblatt in the early 1960’s.

– They appeared to have a very powerful learning algorithm. – Lots of grand claims were made for what they could learn to do.

  • In 1969, Minsky and Papert published a book called “Perceptrons”

that analyzed what they could do and showed their limitations.

– Many people thought these limitations applied to all neural network models.

slide-26
SLIDE 26

What binary threshold neurons cannot do

  • A binary threshold output unit cannot even tell if two single bit

features are the same!

  • A geometric view of what binary threshold neurons cannot do
  • The positive and negative cases cannot be separated by a plane
slide-27
SLIDE 27

What binary threshold neurons cannot do

  • Positive cases (same): (1,1)->1; (0,0)->1
  • Negative cases (different): (1,0)->0; (0,1)->0
  • The four input-output pairs give four inequalities that are impossible

to satisfy:

– w1 +w2 ≥θ – 0 ≥θ – w1 <θ – w2 <θ

slide-28
SLIDE 28

Discriminating simple patterns under translation with wrap-around

  • Suppose we just use pixels as the

features.

  • binary

decision unit cannot discriminate patterns with the same number of on pixels

– if the patterns can translate with wrap- around!

slide-29
SLIDE 29

Sketch of a proof

  • For pattern A, use training cases in all possible translations.

– Each pixel will be activated by 4 different translations of pattern A. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

  • For pattern B, use training cases in all possible translations.

– Each pixel will be activated by 4 different translations of pattern B. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

  • But to discriminate correctly, every single case of pattern A must provide

more input to the decision unit than every single case of pattern B.

  • This is impossible if the sums over cases are the same.
slide-30
SLIDE 30

Networks with hidden units

  • Networks without hidden units are very limited in the input-output

mappings they can learn to model.

– More layers of linear units do not help. Its still linear. – Fixed output non-linearities are not enough.

  • We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

slide-31
SLIDE 31

Feed-forward neural networks

  • Also called Multi-Layer Perceptron (MLP)
slide-32
SLIDE 32

General approximator

32

  • If the decision boundary is smooth, then a 3-layer network (i.e. 2

hidden layer) can come arbitrarily close to the target classifier

slide-33
SLIDE 33

33

MLP with Different Number of Layers

Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals MLP with unit step activation function Decision region found by an output unit.

slide-34
SLIDE 34

Beyond linear models

slide-35
SLIDE 35

Beyond linear models

slide-36
SLIDE 36

MLP with single hidden layer

36

  • Two-layer MLP (Number of layers of adaptive weights is counted)

𝑝𝑙 𝒚 = 𝜔

𝑘=0 𝑁

𝑥

𝑘𝑙 [2]𝑨 𝑘

⇒ 𝑝𝑙 𝒚 = 𝜔

𝑘=0 𝑁

𝑥

𝑘𝑙 [2]𝜚 𝑗=0 𝑒

𝑥𝑗𝑘

[1]𝑦𝑗

… … … Input Output 𝑦0 = 1 𝑦𝑒 𝑝1 𝑝𝐿 𝑥

𝑘𝑙 [2]

𝑥𝑗𝑘

[1]

𝜚 𝜔 𝑨0 = 1 𝑨1 𝑨𝑁 𝑨𝑘 𝜔 𝜚 𝜚 𝑦1 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿

slide-37
SLIDE 37

learns to extract features

37

  • MLP with one hidden layer is a generalized linear model:

– 𝑝𝑙(𝒚) = 𝜔 𝑘=1

𝑁

𝑥

𝑘𝑙 [2]𝑔

𝑘(𝒚)

– 𝑔

𝑘 𝒚 = 𝜚 𝑗=0 𝑒

𝑥

𝑘𝑗 [1]𝑦𝑗

– The form of the nonlinearity (basis functions 𝑔

𝑘) is adapted from the training data

(not fixed in advance)

  • 𝑔

𝑘 is defined based on parameters which can be also adapted during training

  • Thus, we don’t need expert knowledge or time consuming tuning of

hand-crafted features

slide-38
SLIDE 38

Deep networks

  • Deeper networks (with multiple hidden layers) can work better than a

single-hidden-layer networks is an empirical observation

– despite the fact that their representational power is equal.

  • In practice usually 3-layer neural networks will outperform 2-layer

nets, but going even deeper may not help much more.

– This is in stark contrast to Convolutional Networks

slide-39
SLIDE 39

How to adjust weights for multi layer networks?

  • We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

– We need an efficient way of adapting all the weights, not just the last layer. – Learning the weights going into hidden units is equivalent to learning features. – This is difficult because nobody is telling us directly what the hidden units should do.

slide-40
SLIDE 40
slide-41
SLIDE 41

Gradient descent

  • We want 𝛼𝑿𝑀 𝑿
  • Numerical gradient:

– slow :( – approximate :( – easy to write :)

  • Analytic gradient:

– fast :) – exact :) – error-prone :(

  • In practice: Derive analytic gradient, check your implementation with

numerical gradient

slide-42
SLIDE 42

Training multi-layer networks

42

  • Backpropagation

– Training algorithm that is used to adjust weights in multi-layer networks (based on the training data) – The backpropagation algorithm is based on gradient descent – Use chain rule and dynamic programming to efficiently compute gradients

slide-43
SLIDE 43

Computational graphs

slide-44
SLIDE 44

Backpropagation: a simple example

slide-45
SLIDE 45

Backpropagation: a simple example

slide-46
SLIDE 46

Backpropagation: a simple example

slide-47
SLIDE 47

Backpropagation: a simple example

slide-48
SLIDE 48

Backpropagation: a simple example

slide-49
SLIDE 49

Backpropagation: a simple example

slide-50
SLIDE 50

Backpropagation: a simple example

slide-51
SLIDE 51

How to propagate the gradients backward

slide-52
SLIDE 52

How to propagate the gradients backward

slide-53
SLIDE 53

Another example

slide-54
SLIDE 54

Another example

slide-55
SLIDE 55

Another example

slide-56
SLIDE 56

Another example

slide-57
SLIDE 57

Another example

slide-58
SLIDE 58

Another example

slide-59
SLIDE 59

Another example

slide-60
SLIDE 60

Another example

slide-61
SLIDE 61

Another example

slide-62
SLIDE 62

Another example

slide-63
SLIDE 63

Another example

slide-64
SLIDE 64

Another example

slide-65
SLIDE 65

Another example

slide-66
SLIDE 66

Another example

[local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

slide-67
SLIDE 67

Derivative of sigmoid function

slide-68
SLIDE 68

Derivative of sigmoid function

slide-69
SLIDE 69

Patterns in backward flow

  • add gate: gradient distributor
  • max gate: gradient router
slide-70
SLIDE 70

Gradients add at branches

slide-71
SLIDE 71

Simple chain rule

  • 𝑨 = 𝑔 𝑕 𝑦
  • 𝑧 = 𝑕(𝑦)
slide-72
SLIDE 72

Multiple paths chain rule

slide-73
SLIDE 73

Modularized implementation: forward / backward API

slide-74
SLIDE 74

Modularized implementation: forward / backward API

slide-75
SLIDE 75

Modularized implementation: forward / backward API

slide-76
SLIDE 76

Caffe Layers

slide-77
SLIDE 77

Caffe sigmoid layer

slide-78
SLIDE 78

Output as a composite function

𝑃𝑣𝑢𝑞𝑣𝑢 = 𝑏[𝑀] = 𝑔 𝑨[𝑀] = 𝑔 𝑋[𝑀]𝑏[𝑀−1] = 𝑔 𝑋[𝑀]𝑔(𝑋[𝑀−1]𝑏[𝑀−2] = 𝑔 𝑋[𝑀]𝑔 𝑋[𝑀−1] … 𝑔 𝑋[2]𝑔 𝑋[1]𝑦

For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

𝑋[1] 𝑦 × 𝑔 𝑋[2] × 𝑔 𝑋[𝑀] × 𝑔 𝑨[1] 𝑏[1] 𝑨[2] 𝑏[𝑀−1] 𝑨[𝑀] 𝑏[𝑀] 𝑏[𝑀] = 𝑝𝑣𝑢𝑞𝑣𝑢

slide-79
SLIDE 79

Backpropagation: Notation

79

  • 𝒃[0] ← 𝐽𝑜𝑞𝑣𝑢
  • 𝑝𝑣𝑢𝑞𝑣𝑢 ← 𝒃[𝑀]

𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝒃[𝑚−1] 𝒃[𝑚] 𝒜[𝑚]

slide-80
SLIDE 80

Backpropagation: Last layer gradient

𝜖𝐹𝑜 𝜖𝑋

𝑗𝑘 [𝑀] = 𝜖𝐹𝑜

𝜖𝑏[𝑀] 𝜖𝑏[𝑀] 𝜖𝑋

𝑗𝑘 [𝑀]

𝜖𝐹 𝜖𝑏[𝑀] = 2(𝑧 − 𝑏[𝑀]) 𝜖𝑏[𝑀] 𝜖𝑋

𝑗𝑘 [𝑀] = 𝑔′ 𝑨 𝑘 [𝑀]

𝜖𝑨

𝑘 [𝑀]

𝜖𝑋

𝑗𝑘 [𝑀]

= 𝑔′ 𝑨

𝑘 [𝑀] 𝑏𝑗 [𝑀−1]

𝑏𝑗

[𝑚−1]

𝑨𝑘

[𝑚]

𝑏𝑘

[𝑚]

𝑔 𝑏𝑗

[𝑚] = 𝑔 𝑨𝑗 [𝑚]

𝑨𝑘

[𝑚] = 𝑘=0 𝑁

𝑥𝑗𝑘

[𝑚]𝑏𝑗 [𝑚−1]

For squared error loss: 𝐹𝑜 = 𝑏[𝑀] − 𝑧 𝑜

2

𝑥𝑗𝑘

[𝑚]

slide-81
SLIDE 81

Backpropagation:

81

𝜖𝐹𝑜 𝜖𝑥𝑗𝑘

[𝑚] = 𝜖𝐹𝑜

𝜖𝑏𝑘

[𝑚] ×

𝜖𝑏𝑘

[𝑚]

𝜖𝑥𝑗𝑘

[𝑚]

= 𝜀

𝑘 [𝑚] × 𝑏𝑗 [𝑚−1] × 𝑔′ 𝑨𝑗 [𝑚]

 𝜀

𝑘 [𝑚] = 𝜖𝐹𝑜 𝜖𝑏𝑘

[𝑚] is the sensitivity of the output to 𝑏𝑘

[𝑚]

 sensitivity vectors can be obtained by running a

backward process in the network architecture (hence the name backpropagation.)

𝑏𝑗

[𝑚−1]

𝑨𝑘

[𝑚]

𝑏𝑘

[𝑚]

𝑔 𝑏𝑗

[𝑚] = 𝑔 𝑨𝑗 [𝑚]

𝑨𝑘

[𝑚] = 𝑘=0 𝑁

𝑥𝑗𝑘

[𝑚]𝑏𝑗 [𝑚−1]

𝑥𝑗𝑘

[𝑚]

slide-82
SLIDE 82

𝜀𝑗

[𝑚−1] from 𝜀𝑗 [𝑚]

82

We will compute 𝜺[𝑚−1] from 𝜺[𝑚]: 𝜀𝑗

[𝑚−1] =

𝜖𝐹𝑜 𝜖𝑏𝑗

[𝑚−1]

=

𝑘=1 𝑒[𝑚]

𝜖𝐹𝑜 𝜖𝑏𝑘

[𝑚] ×

𝜖𝑏𝑘

[𝑚]

𝜖𝑨

𝑘 [𝑚] ×

𝜖𝑨

𝑘 [𝑚]

𝜖𝑏𝑗

[𝑚−1]

=

𝑘=1 𝑒[𝑚]

𝜖𝐹𝑜 𝜖𝑏𝑘

[𝑚] × 𝑔′ 𝑨𝑗 [𝑚] × 𝑥𝑗𝑘 [𝑚]

=

𝑘=1 𝑒[𝑚]

𝜀

𝑘 [𝑚] × 𝑔′ 𝑨𝑗 [𝑚] × 𝑥𝑗𝑘 [𝑚]

= 𝑔′ 𝑨𝑗

[𝑚] × 𝑘=1 𝑒[𝑚]

𝜀

𝑘 [𝑚] × 𝑥𝑗𝑘 [𝑚]

𝑏𝑗

[𝑚−1] = 𝑔 𝑨𝑗 [𝑚−1]

𝑨𝑘

[𝑚] = 𝑘=0 𝑁

𝑥𝑗𝑘

[𝑚]𝑏𝑗 [𝑚−1]

𝑏𝑗

[𝑚−1]

𝑏𝑘

[𝑚]

𝑔′ 𝑨𝑗

[𝑚−1]

𝑏𝑗

[𝑚−1]

𝑨𝑘

[𝑚]

𝑏𝑘

[𝑚]

𝑔 𝑥𝑗𝑘

[𝑚]

𝑥𝑗𝑘

[𝑚]

𝜀

𝑘 [𝑚]

𝜀

𝑘 [𝑚−1]

slide-83
SLIDE 83

Find and save 𝜺[𝑀]

83

  • Called error, computed recursively in backward manner
  • For the final layer 𝑚 = 𝑀:

𝜀

𝑘 [𝑀] = 𝜖𝐹𝑜

𝜖𝑏𝑘

[𝑀]

slide-84
SLIDE 84

Backpropagation of Errors

84

𝜀𝑗

[1] = 𝑔′ 𝑨𝑗 [1] × 𝑘=1 𝑒[2]

𝜀

𝑘 [2] × 𝑥𝑗𝑘 [2]

𝜀

𝑘 [2] = 2 𝑏𝑘 [2] − 𝑧𝑘 (𝑜) 𝑔′ 𝑨𝑘 [2]

𝑥𝑗𝑘

[2]

𝜀𝑗

[1]

𝜀1

[2]

𝜀2

[2]

𝜀3

[2]

𝜀

𝑘 [2]

𝜀𝐿

[2]

𝐹𝑜 =

𝑘

𝑏𝑘

[𝑀] − 𝑧𝑘 𝑜 2

slide-85
SLIDE 85

Gradients for vectorized code

slide-86
SLIDE 86

Vectorized operations

slide-87
SLIDE 87

Vectorized operations

slide-88
SLIDE 88

Vectorized operations

slide-89
SLIDE 89

Vectorized operations

slide-90
SLIDE 90

Vectorized operations

slide-91
SLIDE 91
slide-92
SLIDE 92
slide-93
SLIDE 93
slide-94
SLIDE 94
slide-95
SLIDE 95
slide-96
SLIDE 96
slide-97
SLIDE 97

Always check: The gradient with respect to a variable should have the same shape as the Variable

slide-98
SLIDE 98
slide-99
SLIDE 99

SVM example

slide-100
SLIDE 100

Summary

  • Neural nets may be very large: impractical to write down gradient formula by

hand for all parameters

  • Backpropagation = recursive application of the chain rule along a computational

graph to compute the gradients of all inputs/parameters/intermediates

  • Implementations maintain a graph structure, where the nodes implement the

forward() / backward() API

– forward: compute result of an operation and save any intermediates needed for gradient computation in memory – backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

slide-101
SLIDE 101

Converting error derivatives into a learning procedure

  • The backpropagation algorithm is an efficient way of computing the

error derivative dE/dw for every weight on a single training case.

  • To get a fully specified learning procedure, we still need to make a lot
  • f other decisions about how to use these error derivatives:

– Optimization issues: How do we use the error derivatives on individual cases to discover a good set of weights? – Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training?

slide-102
SLIDE 102

Optimization issues in using the weight derivatives

  • How to initialize weights
  • How often to update the weights

– Batch size

  • How much to update

– Use a fixed learning rate? – Adapt the global learning rate? – Adapt the learning rate on each connection separately? – Don’t use steepest descent?

slide-103
SLIDE 103

Overfitting: The downside of using powerful models

  • A model is convincing when it fits a lot of data surprisingly well.

– It is not surprising that a complicated model can fit a small amount of data well.

slide-104
SLIDE 104

Ways to reduce overfitting

  • A large number of different methods have been developed.

– Weight-decay – Weight-sharing – Early stopping – Model averaging – Dropout – Generative pre-training

slide-105
SLIDE 105

Resources

  • Deep Learning Book, Chapter 6.
  • Please see the following note:

– http://cs231n.github.io/optimization-2/