On the interplay of network structure and gradient convergence in - - PowerPoint PPT Presentation

on the interplay of network structure and gradient
SMART_READER_LITE
LIVE PREVIEW

On the interplay of network structure and gradient convergence in - - PowerPoint PPT Presentation

On the interplay of network structure and gradient convergence in deep learning Vikas Singh , Vamsi K. Ithapu Sathya N. Ravi Computer Sciences Biostatistics and Medical Informatics University of Wisconsin Madison Sep 28,


slide-1
SLIDE 1

On the interplay of network structure and gradient convergence in deep learning

Vamsi K. Ithapu⋆ Sathya N. Ravi⋆ Vikas Singh†,⋆

⋆ Computer Sciences † Biostatistics and Medical Informatics

University of Wisconsin Madison

Sep 28, 2016

(UW-Madison) Network structure vs. convergence Sep 28, 2016 1 / 41

slide-2
SLIDE 2

Overview

1

Background Motivation

2

Problem Solution strategy Single-layer Networks Multi-layer Networks

3

Discussion

(UW-Madison) Network structure vs. convergence Sep 28, 2016 2 / 41

slide-3
SLIDE 3

Background

Deep Learning – Neural Networks

x: inputs, h: hidden representations, y: outputs Training data {x, y} ∈ X

(UW-Madison) Network structure vs. convergence Sep 28, 2016 3 / 41

slide-4
SLIDE 4

Background

Deep Learning – Neural Networks

x: inputs, h: hidden representations, y: outputs Training data {x, y} ∈ X

W1 σ1(·) x h1 = σ1(W1, h0)

Linear map

(UW-Madison) Network structure vs. convergence Sep 28, 2016 3 / 41

slide-5
SLIDE 5

Background

Deep Learning – Neural Networks

x: inputs, h: hidden representations, y: outputs Training data {x, y} ∈ X

W1 σ1(·) x h1 = σ1(W1, h0)

Linear map Non-linearity

(UW-Madison) Network structure vs. convergence Sep 28, 2016 3 / 41

slide-6
SLIDE 6

Background

Deep Learning – Neural Networks

x: inputs, h: hidden representations, y: outputs Depth L Network

x h1 = σ1(W1, h0) h2 = σ2(W2, h1) hL−1 = σL−1(WL−1, hL−2)

Layer 1 Layer 2 Layer L

ˆ y = σL(WL, hL−1)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 4 / 41

slide-7
SLIDE 7

Background

Deep Learning – Neural Networks

x: inputs, h: hidden representations, y: outputs Depth L Network

x h1 = σ1(W1, h0) h2 = σ2(W2, h1) hL−1 = σL−1(WL−1, hL−2)

Layer 1 Layer 2 Layer L

ˆ y = σL(WL, hL−1)

σ(·) : Nonlinear Monotonic Non-convex Non-smooth

(UW-Madison) Network structure vs. convergence Sep 28, 2016 4 / 41

slide-8
SLIDE 8

Background

Deep Learning – Neural Networks

x: inputs, h: hidden representations, y: outputs Depth L Network

x h1 = σ1(W1, h0) h2 = σ2(W2, h1) hL−1 = σL−1(WL−1, hL−2)

Layer 1 Layer 2 Layer L

ˆ y = σL(WL, hL−1)

σ(·) : Nonlinear Monotonic Non-convex Non-smooth Typical choices of σ(·): Sigmoid or Hyperbolic Tangent Rectified Linear Unit (ReLU) Convolution + Sub-sampling

(UW-Madison) Network structure vs. convergence Sep 28, 2016 4 / 41

slide-9
SLIDE 9

Background

Deep Learning – Neural Networks

Learning Objective: min

W

Ex,y∼X L(x, y; W) W := {W1 . . . , WL}

(UW-Madison) Network structure vs. convergence Sep 28, 2016 5 / 41

slide-10
SLIDE 10

Background

Deep Learning – Neural Networks

Learning Objective: min

W

Ex,y∼X L(x, y; W) W := {W1 . . . , WL} Non-convex

(UW-Madison) Network structure vs. convergence Sep 28, 2016 5 / 41

slide-11
SLIDE 11

Background

Deep Learning – Neural Networks

Learning Objective: min

W

Ex,y∼X L(x, y; W) W := {W1 . . . , WL} Non-convex Stochastic Gradients are used Gradient backpropagation

(UW-Madison) Network structure vs. convergence Sep 28, 2016 5 / 41

slide-12
SLIDE 12

Background

Deep Learning – Neural Networks

Stochastic Gradients are used ... with some tricks!

(UW-Madison) Network structure vs. convergence Sep 28, 2016 6 / 41

slide-13
SLIDE 13

Background

Deep Learning – Neural Networks

Stochastic Gradients are used ... with some tricks!

  • Appropriate Nonlinearities

ReLU, Log-sigmoid, Max-pooling etc.

(UW-Madison) Network structure vs. convergence Sep 28, 2016 6 / 41

slide-14
SLIDE 14

Background

Deep Learning – Neural Networks

Stochastic Gradients are used ... with some tricks!

  • Appropriate Nonlinearities

ReLU, Log-sigmoid, Max-pooling etc.

  • Initializations

Pretrain (Warm-start) the network layers → Using unlabeled data – Unsupervised Pretraining

(UW-Madison) Network structure vs. convergence Sep 28, 2016 6 / 41

slide-15
SLIDE 15

Background

Deep Learning – Neural Networks

Stochastic Gradients are used ... with some tricks!

  • Appropriate Nonlinearities

ReLU, Log-sigmoid, Max-pooling etc.

  • Initializations

Pretrain (Warm-start) the network layers → Using unlabeled data – Unsupervised Pretraining

  • Learning mechanisms

Stochastically learn parts of network → Dropout, DropConnect

(UW-Madison) Network structure vs. convergence Sep 28, 2016 6 / 41

slide-16
SLIDE 16

Background

Deep Learning – Neural Networks

Stochastic Gradients are used ... with some tricks!

  • Appropriate Nonlinearities

ReLU, Log-sigmoid, Max-pooling etc.

  • Initializations

Pretrain (Warm-start) the network layers → Using unlabeled data – Unsupervised Pretraining

  • Learning mechanisms

Stochastically learn parts of network → Dropout, DropConnect

  • Large Dataset sizes

(UW-Madison) Network structure vs. convergence Sep 28, 2016 6 / 41

slide-17
SLIDE 17

Background

Deep Learning – Neural Networks

Attractive empirical success

(UW-Madison) Network structure vs. convergence Sep 28, 2016 7 / 41

slide-18
SLIDE 18

Background

Deep Learning – Neural Networks

Attractive empirical success ... some interesting theoretical results

Arora et. al. 2013, Dauphin et. al. 2014, Patel et. al. 2015

(UW-Madison) Network structure vs. convergence Sep 28, 2016 7 / 41

slide-19
SLIDE 19

Background

Deep Learning – Neural Networks

Attractive empirical success ... some interesting theoretical results

Arora et. al. 2013, Dauphin et. al. 2014, Patel et. al. 2015 Theme of most works

(UW-Madison) Network structure vs. convergence Sep 28, 2016 7 / 41

slide-20
SLIDE 20

Background

Deep Learning – Neural Networks

Attractive empirical success ... some interesting theoretical results

Arora et. al. 2013, Dauphin et. al. 2014, Patel et. al. 2015 Theme of most works → Analyze a given architecture/structure the depth L, hidden layer lengths (d1, . . . , dL−1) hidden layer activations are known

(UW-Madison) Network structure vs. convergence Sep 28, 2016 7 / 41

slide-21
SLIDE 21

Background

Deep Learning – Neural Networks

Attractive empirical success ... some interesting theoretical results

Arora et. al. 2013, Dauphin et. al. 2014, Patel et. al. 2015 Theme of most works → Analyze a given architecture/structure the depth L, hidden layer lengths (d1, . . . , dL−1) hidden layer activations are known → Existence of some network structure is proven

(UW-Madison) Network structure vs. convergence Sep 28, 2016 7 / 41

slide-22
SLIDE 22

Background Motivation

The Problem

What is the best possible network for the given task?

(UW-Madison) Network structure vs. convergence Sep 28, 2016 8 / 41

slide-23
SLIDE 23

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

(UW-Madison) Network structure vs. convergence Sep 28, 2016 9 / 41

slide-24
SLIDE 24

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future (UW-Madison) Network structure vs. convergence Sep 28, 2016 9 / 41

slide-25
SLIDE 25

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial

(UW-Madison) Network structure vs. convergence Sep 28, 2016 9 / 41

slide-26
SLIDE 26

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial (UW-Madison) Network structure vs. convergence Sep 28, 2016 10 / 41

slide-27
SLIDE 27

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial

Bottleneck on the available #instances Brain image acquisition is costly!

(UW-Madison) Network structure vs. convergence Sep 28, 2016 10 / 41

slide-28
SLIDE 28

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial

  • Cheapest – #computations, $cost

Dollar value associated per hour of computation (e.g., using Amazon Web Services)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 11 / 41

slide-29
SLIDE 29

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial

  • Cheapest – #computations, $cost

Dollar value associated per hour of computation (e.g., using Amazon Web Services)

  • Richer (Largest) models are desired

(UW-Madison) Network structure vs. convergence Sep 28, 2016 11 / 41

slide-30
SLIDE 30

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial

Some false-positives allowed

(UW-Madison) Network structure vs. convergence Sep 28, 2016 12 / 41

slide-31
SLIDE 31

Background Motivation

The Motivating Application

Amyloid PET Images

Collected from Middle-aged Adults

Deep Network Predictor

The probability of disease in future

Send to trial Do not send to trial

A non-expert is going to setup the learning

(UW-Madison) Network structure vs. convergence Sep 28, 2016 13 / 41

slide-32
SLIDE 32

Problem

The Problem – reformulated

We need informed or systematic design strategies for the choosing network structure

(UW-Madison) Network structure vs. convergence Sep 28, 2016 14 / 41

slide-33
SLIDE 33

Problem Solution strategy

The Solution strategy

What is the best possible network for the given task? Need informed design strategies Part I

(UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

slide-34
SLIDE 34

Problem Solution strategy

The Solution strategy

What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds

  • Gradient convergence + Learning Mechanism + Network/Data Statistics

(UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

slide-35
SLIDE 35

Problem Solution strategy

The Solution strategy

What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds

  • Gradient convergence + Learning Mechanism + Network/Data Statistics

Part II

(UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

slide-36
SLIDE 36

Problem Solution strategy

The Solution strategy

What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds

  • Gradient convergence + Learning Mechanism + Network/Data Statistics

Part II Construct design procedures using the bounds

  • For the given dataset, a pre-specified convergence level

Find the depth, hidden layer lengths, etc.

(UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

slide-37
SLIDE 37

Problem Solution strategy

The Solution strategy – This work

What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds

  • Gradient convergence + Learning Mechanism + Network/Data Statistics

Part II Construct design procedures using the bounds

  • For the given dataset, a pre-specified convergence level

Find the depth, hidden layer lengths, etc.

(UW-Madison) Network structure vs. convergence Sep 28, 2016 16 / 41

slide-38
SLIDE 38

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-39
SLIDE 39

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-40
SLIDE 40

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths (d0, d1, . . . , dL−1, dL)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-41
SLIDE 41

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths (d0, d1, . . . , dL−1, dL) → The activation functions (σ1, . . . , σL)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-42
SLIDE 42

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths (d0, d1, . . . , dL−1, dL) → The activation functions (σ1, . . . , σL) Bounded and Smooth; Focus on Sigmoid

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-43
SLIDE 43

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths (d0, d1, . . . , dL−1, dL) → The activation functions (σ1, . . . , σL) Bounded and Smooth; Focus on Sigmoid → Average first-moment

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-44
SLIDE 44

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths (d0, d1, . . . , dL−1, dL) → The activation functions (σ1, . . . , σL) Bounded and Smooth; Focus on Sigmoid → Average first-moment µx = 1

d0

  • j Exj, τx = 1

d0

  • j E2xj

(UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

slide-45
SLIDE 45

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics min

W

f (W) := Ex,y∼X L(x, y; W)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41

slide-46
SLIDE 46

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics min

W

f (W) := Ex,y∼X L(x, y; W) → L := ℓ2 Loss

(UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41

slide-47
SLIDE 47

Problem Solution strategy

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics min

W

f (W) := Ex,y∼X L(x, y; W) → L := ℓ2 Loss Stochastic Gradients W ∈ Rd OR Projected Gradients W ∈ Ω := Box-constraint [−w, w]d

(UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41

slide-48
SLIDE 48

Problem Solution strategy

The Interplay – Gradient Convergence

Ideally interested in generalization

(UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

slide-49
SLIDE 49

Problem Solution strategy

The Interplay – Gradient Convergence

Ideally interested in generalization

(UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

slide-50
SLIDE 50

Problem Solution strategy

The Interplay – Gradient Convergence

Ideally interested in generalization Convergence instead? R : Last iteration – In general, training time is fixed apriori

(UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

slide-51
SLIDE 51

Problem Solution strategy

The Interplay – Gradient Convergence

Ideally interested in generalization Convergence instead? R : Last iteration – In general, training time is fixed apriori The expected gradients ∆ := ER,x,y∇Wf (WR)2

(UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

slide-52
SLIDE 52

Problem Solution strategy

The Interplay – Gradient Convergence

Ideally interested in generalization Convergence instead? R : Last iteration – In general, training time is fixed apriori The expected gradients ∆ := ER,x,y∇Wf (WR)2 Control on last/stopping iteration

(UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

slide-53
SLIDE 53

Problem Solution strategy

The Interplay – Gradient Convergence

Ideally interested in generalization Convergence instead? R : Last iteration – In general, training time is fixed apriori The expected gradients ∆ := ER,x,y∇Wf (WR)2 Control on last/stopping iteration Under mild assumptions, ∆ can be bounded when- ever R is chosen randomly [Ghadimi and Lan 2013]

(UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

slide-54
SLIDE 54

Problem Solution strategy

The Interplay – Gradient Convergence

Gradients backpropagation + randomly stop after some iterations

(UW-Madison) Network structure vs. convergence Sep 28, 2016 20 / 41

slide-55
SLIDE 55

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network

(UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

slide-56
SLIDE 56

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 21 / 41

slide-57
SLIDE 57

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Decreasing stepsizes

(UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

slide-58
SLIDE 58

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Decreasing stepsizes

the stopping distribution R ∈ [1, N] (N ≫ R) N: Maximum allowable iterations

(UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

slide-59
SLIDE 59

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Decreasing stepsizes

the stopping distribution R ∈ [1, N] (N ≫ R) N: Maximum allowable iterations ∆ : Expected gradients

(UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

slide-60
SLIDE 60

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 22 / 41

slide-61
SLIDE 61

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Df ≈ f (W1)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

slide-62
SLIDE 62

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Df ≈ f (W1)

HN ≈ 0.2γGenHar(N, ρ) N : Maximum allowable iterations

(UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

slide-63
SLIDE 63

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Df ≈ f (W1)

HN ≈ 0.2γGenHar(N, ρ) N : Maximum allowable iterations Goodness of fit – Influence of W1

(UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

slide-64
SLIDE 64

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Df ≈ f (W1)

HN ≈ 0.2γGenHar(N, ρ) N : Maximum allowable iterations Goodness of fit – Influence of W1

  • Sublinear decay vs. N

(UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

slide-65
SLIDE 65

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Ψ ≈ q d0d1γ

B

(0.05 < q < 0.25) d0d1 := #unknowns

(UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41

slide-66
SLIDE 66

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Ψ ≈ q d0d1γ

B

(0.05 < q < 0.25) d0d1 := #unknowns Influence of #free parameters (degrees of freedom)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41

slide-67
SLIDE 67

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Ψ ≈ q d0d1γ

B

(0.05 < q < 0.25) d0d1 := #unknowns Influence of #free parameters (degrees of freedom) Bias from mini-batch size

(UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41

slide-68
SLIDE 68

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Ideal scenario:

Large #samples; Small network

(UW-Madison) Network structure vs. convergence Sep 28, 2016 24 / 41

slide-69
SLIDE 69

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Expected Gradients For 1-layer network with stepsizes γk = γ

kρ (ρ > 0) and

PR(k) = γk(1 − 0.75γk), we have ∆ ≤

Df

HN + Ψ

  • Ideal scenario:

Large #samples; Small network

  • Realistic scenario:

Reasonable network size; Large B with long training time

(UW-Madison) Network structure vs. convergence Sep 28, 2016 24 / 41

slide-70
SLIDE 70

Problem Single-layer Networks

The Interplay – Gradient Convergence

for small ρ i.e, slow stepsize decay PR(k) approaches a uniform distribution

(UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

slide-71
SLIDE 71

Problem Single-layer Networks

The Interplay – Gradient Convergence

for small ρ i.e, slow stepsize decay PR(k) approaches a uniform distribution ∆

5Df

Nγ + Ψ

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 25 / 41

slide-72
SLIDE 72

Problem Single-layer Networks

The Interplay – Gradient Convergence

for small ρ i.e, slow stepsize decay PR(k) approaches a uniform distribution ∆

5Df

Nγ + Ψ

  • when ρ = 0 i.e., constant stepsize

PR(k) := UNIF[1, N]

(UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

slide-73
SLIDE 73

Problem Single-layer Networks

The Interplay – Gradient Convergence

for small ρ i.e, slow stepsize decay PR(k) approaches a uniform distribution ∆

5Df

Nγ + Ψ

  • when ρ = 0 i.e., constant stepsize

PR(k) := UNIF[1, N] ∆ ≤

Df

Nγ + Ψ

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 25 / 41

slide-74
SLIDE 74

Problem Single-layer Networks

The Interplay – Gradient Convergence

for small ρ i.e, slow stepsize decay PR(k) approaches a uniform distribution ∆

5Df

Nγ + Ψ

  • when ρ = 0 i.e., constant stepsize

PR(k) := UNIF[1, N] ∆ ≤

Df

Nγ + Ψ

  • Uniform stopping may not

be interesting!

(UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

slide-75
SLIDE 75

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

slide-76
SLIDE 76

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k) Push R to be as close as possible to N

(UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

slide-77
SLIDE 77

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k) Push R to be as close as possible to N

PR(k) = 0 PR(k) = ν

N

(UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

slide-78
SLIDE 78

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k) Push R to be as close as possible to N

PR(k) = 0 PR(k) = ν

N

Expected Gradients + PR(·) from above example For 1-layer network with constant stepsize γ, we have ∆ ≤ ν

5Df

Nγ + Ψ

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 26 / 41

slide-79
SLIDE 79

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k) Push R to be as close as possible to N

PR(k) = 0 PR(k) = ν

N

Expected Gradients + PR(·) from above example For 1-layer network with constant stepsize γ, we have ∆ ≤ ν

5Df

Nγ + Ψ

  • require PR(k) ≤ PR(k + 1)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

slide-80
SLIDE 80

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k) Push R to be as close as possible to N

PR(k) = 0 PR(k) = ν

N

Expected Gradients + PR(·) from above example For 1-layer network with constant stepsize γ, we have ∆ ≤ ν

5Df

Nγ + Ψ

  • require PR(k) ≤ PR(k + 1)

For ν ≫ 1, R → N

(UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

slide-81
SLIDE 81

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network + Customized PR(k) Push R to be as close as possible to N

PR(k) = 0 PR(k) = ν

N

Expected Gradients + PR(·) from above example For 1-layer network with constant stepsize γ, we have ∆ ≤ ν

5Df

Nγ + Ψ

  • require PR(k) ≤ PR(k + 1)

For ν ≫ 1, R → N bound too loose

(UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

slide-82
SLIDE 82

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Using T independent random stopping iterations

(UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41

slide-83
SLIDE 83

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Using T independent random stopping iterations Large deviation estimate

(UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41

slide-84
SLIDE 84

Problem Single-layer Networks

The Interplay – Gradient Convergence

Single-layer Network Using T independent random stopping iterations Large deviation estimate Let ǫ > 0 and 0 < δ ≪ 1. An (ǫ, δ)-solution guarantees Pr

  • mint ∇Wf (WRt)2 ≤ ǫ
  • ≥ 1 − δ

(UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41

slide-85
SLIDE 85

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-86
SLIDE 86

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-87
SLIDE 87

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-88
SLIDE 88

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Initialize (or Warm-start or Pretrain) each of the layers sequentially

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-89
SLIDE 89

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Initialize (or Warm-start or Pretrain) each of the layers sequentially

x → ˜ x (w.p. 1 − ζ, the jth unit is 0)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-90
SLIDE 90

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Initialize (or Warm-start or Pretrain) each of the layers sequentially

x → ˜ x (w.p. 1 − ζ, the jth unit is 0) h1 = σ(W1˜ x) L(x, W) = x − h12 with W ∈ [−w, w]d Referred to as a Denoising Autoencoder

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-91
SLIDE 91

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Initialize (or Warm-start or Pretrain) each of the layers sequentially

x → ˜ x (w.p. 1 − ζ, the jth unit is 0) h1 = σ(W1˜ x) L(x, W) = x − h12 with W ∈ [−w, w]d Referred to as a Denoising Autoencoder

  • L − 1 such DAs are learned

x → h1 → . . . hL−2 → hL−1

(UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

slide-92
SLIDE 92

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together

(UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

slide-93
SLIDE 93

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Bring in the ys; perform backpropagation

(UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

slide-94
SLIDE 94

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Bring in the ys; perform backpropagation

Use stochastic gradients; start at Lth-layer Propagate the gradients

(UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

slide-95
SLIDE 95

Problem Multi-layer Networks

The Interplay

Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism

  • Bring in the ys; perform backpropagation

Use stochastic gradients; start at Lth-layer Propagate the gradients → Dropout Update only a fraction (ζ) of all the parameters

(UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

slide-96
SLIDE 96

Problem Multi-layer Networks

The Interplay – Learning Mechanism

Multi-layer Neural Network

(UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

slide-97
SLIDE 97

Problem Multi-layer Networks

The Interplay – Learning Mechanism

Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages

(UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

slide-98
SLIDE 98

Problem Multi-layer Networks

The Interplay – Learning Mechanism

Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages

  • L − 1 layers are initialized to (α, δα) solutions

α : Goodness of pretraning

(UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

slide-99
SLIDE 99

Problem Multi-layer Networks

The Interplay – Learning Mechanism

Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages

  • L − 1 layers are initialized to (α, δα) solutions

α : Goodness of pretraning

  • Gradient backpropagation is performed to a (ǫ, δ) solution

(UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

slide-100
SLIDE 100

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 31 / 41

slide-101
SLIDE 101

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • First known result for multi-layer deep networks

(UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41

slide-102
SLIDE 102

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • First known result for multi-layer deep networks

Unsupervised pretraining

(UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41

slide-103
SLIDE 103

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • First known result for multi-layer deep networks

Unsupervised pretraining + Dropout learning

(UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41

slide-104
SLIDE 104

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • First known result for multi-layer deep networks

Unsupervised pretraining + Dropout learning + Network structure

(UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41

slide-105
SLIDE 105

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • First known result for multi-layer deep networks

Unsupervised pretraining + Dropout learning + Network structure .... to convergence and estimation

(UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41

slide-106
SLIDE 106

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • ∆ : Expected projected gradients

(UW-Madison) Network structure vs. convergence Sep 28, 2016 32 / 41

slide-107
SLIDE 107

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 33 / 41

slide-108
SLIDE 108

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Df ≈ f (W1) (after pretraining)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 33 / 41

slide-109
SLIDE 109

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Df ≈ f (W1) (after pretraining)

N: Backpropagation iterations

(UW-Madison) Network structure vs. convergence Sep 28, 2016 33 / 41

slide-110
SLIDE 110

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Df ≈ f (W1) (after pretraining)

N: Backpropagation iterations e := ζ2g(α, γ, w) Encodes the influence of pretraining, stepsize and box-constraint

(UW-Madison) Network structure vs. convergence Sep 28, 2016 33 / 41

slide-111
SLIDE 111

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Usefulness of the representations

i.e., is hL−1 already good-enough in predicting y

(UW-Madison) Network structure vs. convergence Sep 28, 2016 34 / 41

slide-112
SLIDE 112

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Usefulness of the representations

i.e., is hL−1 already good-enough in predicting y

  • Noise added by dropout

(UW-Madison) Network structure vs. convergence Sep 28, 2016 34 / 41

slide-113
SLIDE 113

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • (UW-Madison)

Network structure vs. convergence Sep 28, 2016 35 / 41

slide-114
SLIDE 114

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Π : Π(α, ζ, γ, B, w, #freedom)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 35 / 41

slide-115
SLIDE 115

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Π : Π(α, ζ, γ, B, w, #freedom)

Polynomial in d0, . . . , dL and L

(UW-Madison) Network structure vs. convergence Sep 28, 2016 35 / 41

slide-116
SLIDE 116

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Π : Π(α, ζ, γ, B, w, #freedom)

Polynomial in d0, . . . , dL and L Linear in α, Polynomial in ζ

(UW-Madison) Network structure vs. convergence Sep 28, 2016 35 / 41

slide-117
SLIDE 117

Problem Multi-layer Networks

The Interplay – The most general result

Multi-layer Neural Network For L-layered network with dropout rate ζ and constant stepsize γ, pretrained to (α, δα), we have ∆ ≤

Df

Ne + Π

  • Π : Π(α, ζ, γ, B, w, #freedom)

Polynomial in d0, . . . , dL and L Linear in α, Polynomial in ζ Complex interplay of Learning modules & Network hyper-parameters

(UW-Madison) Network structure vs. convergence Sep 28, 2016 35 / 41

slide-118
SLIDE 118

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-119
SLIDE 119

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-120
SLIDE 120

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining Small α = ⇒ ζ ∼ 1 (Faster convergence)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-121
SLIDE 121

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining Small α = ⇒ ζ ∼ 1 (Faster convergence) Large α = ⇒ ζ ∼ 0 (Slower convergence)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-122
SLIDE 122

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining Small α = ⇒ ζ ∼ 1 (Faster convergence) Large α = ⇒ ζ ∼ 0 (Slower convergence) No control on α = ⇒ Set ζ to 0.5

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-123
SLIDE 123

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining Small α = ⇒ ζ ∼ 1 (Faster convergence) Large α = ⇒ ζ ∼ 0 (Slower convergence) No control on α = ⇒ Set ζ to 0.5 Pretraining can be bypassed for small networks

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-124
SLIDE 124

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining Small α = ⇒ ζ ∼ 1 (Faster convergence) Large α = ⇒ ζ ∼ 0 (Slower convergence) No control on α = ⇒ Set ζ to 0.5 Pretraining can be bypassed for small networks Everything breaks loose for large networks

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-125
SLIDE 125

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

→ Dropout Compensates Pretraining Small α = ⇒ ζ ∼ 1 (Faster convergence) Large α = ⇒ ζ ∼ 0 (Slower convergence) No control on α = ⇒ Set ζ to 0.5 Pretraining can be bypassed for small networks Everything breaks loose for large networks Only restoration is very large datasets and N

(UW-Madison) Network structure vs. convergence Sep 28, 2016 36 / 41

slide-126
SLIDE 126

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

(UW-Madison) Network structure vs. convergence Sep 28, 2016 37 / 41

slide-127
SLIDE 127

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

A tall-lean network is equivalent to short-fat one

(UW-Madison) Network structure vs. convergence Sep 28, 2016 37 / 41

slide-128
SLIDE 128

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

A tall-lean network is equivalent to short-fat one Depth hurts – but may be not too much

(UW-Madison) Network structure vs. convergence Sep 28, 2016 37 / 41

slide-129
SLIDE 129

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

A tall-lean network is equivalent to short-fat one Depth hurts – but may be not too much Short-fat network asks for large sample size

(UW-Madison) Network structure vs. convergence Sep 28, 2016 37 / 41

slide-130
SLIDE 130

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

A tall-lean network is equivalent to short-fat one Depth hurts – but may be not too much Short-fat network asks for large sample size Small networks on small samples may be a bad combination

(UW-Madison) Network structure vs. convergence Sep 28, 2016 37 / 41

slide-131
SLIDE 131

Discussion

The Interplay – Some Implications

Multi-layer Neural Network ∆ ≤

Df

Ne + Π

  • Interesting trends/outcomes (First theoretical results)

A tall-lean network is equivalent to short-fat one Depth hurts – but may be not too much Short-fat network asks for large sample size Small networks on small samples may be a bad combination Family of networks that guarantee the same convergence

(UW-Madison) Network structure vs. convergence Sep 28, 2016 37 / 41

slide-132
SLIDE 132

Discussion

The Interplay – Experiments

Number of iterations 1000 2000 3000 4000 Expected gradients 0.2 0.4 0.6 0.8 1

(1024,350),L=2 (256,350),L=2 (256,75),L=2 (1024,350),L=3 (256,350),L=3 (256,75),L=3

ˆ ∆ vs. L, dls

(UW-Madison) Network structure vs. convergence Sep 28, 2016 38 / 41

slide-133
SLIDE 133

Discussion

The Interplay – Experiments

Number of iterations 1000 2000 3000 4000 Expected gradients 0.2 0.4 0.6 0.8 1

(1024,350),L=2 (256,350),L=2 (256,75),L=2 (1024,350),L=3 (256,350),L=3 (256,75),L=3

Number of iterations 500 1000 1500 2000 Expected gradients 0.2 0.4 0.6 0.8 1

zeta=1 zeta=0.85 zeta=0.65 zeta=0.5 zeta=0.3 zeta=0.15

ˆ ∆ vs. L, dls ˆ ∆ vs. ζ

(UW-Madison) Network structure vs. convergence Sep 28, 2016 38 / 41

slide-134
SLIDE 134

Discussion

The Interplay – Experiments

Layer 1

x h1

Layer 2

h2

Layer 3

h3

Layer 4

h4

Layer 5

ˆ y

Length d0 Length d1 Length d2 Length d3 Length d4 Length d5

Delta 0.5 1 1.5 Lengths (log10 scale) 0.5 1 1.5 2 2.5 3 Sizes (d5, zeta given)

d0 d1=d4 d2 d3 d5

Delta 0.5 1 1.5 Lengths (log10 scale) 1 2 3 4 Sizes

d0 d1=d4 d2 d3 d5

Designs given d5, ζ and L Designs given L

(UW-Madison) Network structure vs. convergence Sep 28, 2016 39 / 41

slide-135
SLIDE 135

Discussion

Conclusions & Ongoing Work

Conclusions Gradient Convergence + Learning Mechanisms + Network/Data structure → Small tweaks to existing procedures → Theoretical understanding for many existing empirical studies → New trends/outcomes

(UW-Madison) Network structure vs. convergence Sep 28, 2016 40 / 41

slide-136
SLIDE 136

Discussion

Conclusions & Ongoing Work

Conclusions Gradient Convergence + Learning Mechanisms + Network/Data structure → Small tweaks to existing procedures → Theoretical understanding for many existing empirical studies → New trends/outcomes Ongoing Work → Extensions to non-smooth σl(·)s and complex Ω(W) → Part II Find the best network for the given task

(UW-Madison) Network structure vs. convergence Sep 28, 2016 40 / 41

slide-137
SLIDE 137

Discussion

The end...

Thank you! Questions?

NIH AG040396, NSF CAREER 1252725, NSF CCF 1320755, the UW grants ADRC AG033514, ICTR 1UL1RR025011 and CPCP AI117924

(UW-Madison) Network structure vs. convergence Sep 28, 2016 41 / 41