Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3 β’ Collect data and extract features β’ Build model: choose hypothesis class π and loss function π β’ Optimization: minimize the empirical loss
Features π¦ Color Histogram Extract build π§ = π₯ π π π¦ features hypothesis Red Green Blue
Features: part of the model Nonlinear model build π§ = π₯ π π π¦ hypothesis Linear model
Example: Polynomial kernel SVM π¦ 1 π§ = sign(π₯ π π(π¦) + π) π¦ 2 Fixed π π¦
Motivation: representation learning β’ Why donβt we also learn π π¦ ? Learn π π¦ Learn π₯ π π¦ π§ = π₯ π π π¦ π¦
Feedforward networks β’ View each dimension of π π¦ as something to be learned β¦ π§ = π₯ π π π¦ β¦ π¦ π π¦
Feedforward networks π π¦ donβt work: need some nonlinearity β’ Linear functions π π π¦ = π π β¦ π§ = π₯ π π π¦ β¦ π¦ π π¦
Feedforward networks π π¦) where π (β ) is some nonlinear function β’ Typically, set π π π¦ = π (π π β¦ π§ = π₯ π π π¦ β¦ π¦ π π¦
Feedforward deep networks β’ What if we go deeper? β¦ β¦ β¦ β¦ π§ β¦ β¦ β π β 1 π¦ β 2
Figure from Deep learning , by Goodfellow, Bengio, Courville. Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons Figure from Wikipedia
Motivation: abstract neuron model β’ Neuron activated when the correlation between the input and a pattern π π¦ 1 exceeds some threshold π π¦ 2 β’ π§ = threshold(π π π¦ β π) or π§ = π (π π π¦ β π) π§ β’ π (β ) called activation function π¦ π
Motivation: artificial neural networks
Motivation: artificial neural networks β’ Put into layers: feedforward deep networks β¦ β¦ β¦ β¦ π§ β¦ β¦ β π β 1 π¦ β 2
Components in Feedforward networks
Components β’ Representations: β’ Input β’ Hidden variables β’ Layers/weights: β’ Hidden layers β’ Output layer
Components First layer Output layer β¦ β¦ β¦ β¦ π§ β¦ β¦ β π Hidden variables β 1 β 2 Input π¦
Input β’ Represented as a vector β’ Sometimes require some Expand preprocessing, e.g., β’ Subtract mean β’ Normalize to [-1,1]
Output layers Output layer β’ Regression: π§ = π₯ π β + π β’ Linear units: no nonlinearity π§ β
Output layers Output layer β’ Multi-dimensional regression: π§ = π π β + π β’ Linear units: no nonlinearity π§ β
Output layers Output layer β’ Binary classification: π§ = π(π₯ π β + π) β’ Corresponds to using logistic regression on β π§ β
Output layers Output layer β’ Multi-class classification: β’ π§ = softmax π¨ where π¨ = π π β + π β’ Corresponds to using multi-class logistic regression on β π¨ π§ β
Hidden layers β’ Neuron take weighted linear combination of the previous β¦ layer β’ So can think of outputting one value for the next layer β¦ β π β π+1
Hidden layers β’ π§ = π (π₯ π π¦ + π) β’ Typical activation function π π (β ) β’ Threshold t π¨ = π[π¨ β₯ 0] π¦ π§ β’ Sigmoid π π¨ = 1/ 1 + exp(βπ¨) β’ Tanh tanh π¨ = 2π 2π¨ β 1
Hidden layers β’ Problem: saturation π (β ) π¦ π§ Too small gradient Figure borrowed from Pattern Recognition and Machine Learning , Bishop
Hidden layers β’ Activation function ReLU (rectified linear unit) β’ ReLU π¨ = max{π¨, 0} Figure from Deep learning , by Goodfellow, Bengio, Courville.
Hidden layers β’ Activation function ReLU (rectified linear unit) β’ ReLU π¨ = max{π¨, 0} Gradient 1 Gradient 0
Hidden layers β’ Generalizations of ReLU gReLU π¨ = max π¨, 0 + π½ min{π¨, 0} β’ Leaky- ReLU π¨ = max{π¨, 0} + 0.01 min{π¨, 0} β’ Parametric- ReLU π¨ : π½ learnable gReLU π¨ π¨
Recommend
More recommend