deep learning tutorial part i greg shakhnarovich tti
play

Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago - PowerPoint PPT Presentation

Deep Learning Tutorial Part I Greg Shakhnarovich TTI-Chicago December 2016 Deep Learning Tutorial,Part I 1 Overview Goals of the tutorial Somewhat organized overview of basics, and some more advanced topics Demistify jargon Pointers for


  1. Linear classification Learning by gradient descent Stochastic gradient descent: intuition Computing gradient on all N examples is expensive and may be wasteful: many data points provide similar information Idea: present examples one at a time, and pretend that the gradient on the entire set is the same as gradient on one example Formally: estimate gradient of the loss on a single example N 1 � ∇ Θ L ( y i , x i ; Θ) ≈ ∇ Θ L ( y t , x t ; Θ) N i =1 Mini-batch version: for some B ⊂ [ N ] , | B | ≪ N , N 1 1 � � ∇ Θ L ( y i , x i ; Θ) ≈ ∇ Θ L ( y t , x t ; Θ) N | B | i =1 t ∈ B Deep Learning Tutorial,Part I 21

  2. Linear classification Learning by gradient descent Stochastic gradient descent An incremental algorithm: • Present examples ( x i , y i ) one at a time, • Modify w slightly to increase the log-probability of observed y i : w := w + η ∂ ∂ w log p ( y i | x i ; w ) where the learning rate η determines how “slightly”. Epoch (full pass through data) contains N updates instead of one Good practice: shuffle the data each epoch Deep Learning Tutorial,Part I 22

  3. Linear classification Learning by gradient descent Gradient check When implementing gradient-based methods: always include numerical gradient check (gradcheck) Numerical approximation of the partial derivative: ∂f ( x ) ≈ f ( x + δ e j ) − f ( x − δ e j ) ∂x j 2 δ note: this is better than the non-centered f ( x + δ e j ) − f ( x ) δ Can compute this for each parameters in a model, with δ ≈ 10 − 6 Deep Learning Tutorial,Part I 23

  4. Linear classification Learning by gradient descent Gradient check: tips Make sure to use double precision Run on a few data points, at random points in the parameter space caveat: may be important to run around important points, e.g., during convergence Find a way to run on a subset of parameters but careful how you select them: subset of weights for each class is OK, weights for a subset of classes not OK Deep Learning Tutorial,Part I 24

  5. Linear classification Learning by gradient descent Gradient check evaluation Suppose you get the gradient vector g from (analytic) calculation in your code, and g ′ from gradcheck. A good value to look at: | g i − g ′ i | max max( | g i | , | g ′ i | ) Suggested by Andrej Karpathy, who says: relative error > 1e-2 usually means the gradient is probably wrong 1e-2 > relative error > 1e-4 should make you feel uncomfortable 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks [soft objective], then 1e-4 is too high. 1e-7 and less you should be happy. Deep Learning Tutorial,Part I 25

  6. Deep learning: introduction Feature functions Machine learning relies almost entirely on linear predictors But often applied to non-linear features of the data Feature transform: φ : X → R d f y ( x ; w , b ) = w y · φ ( x ) + b y Shallow learning: hand-crafted, non-hierarchical φ . Basic example: polynomial regression, φ j ( x ) = x j , j = 0 , . . . , d , y = w · φ ( x ) ˆ Kernel SVM: employing kernel K corresponds to (some) feature space such that K ( x i , x j ) = φ ( x i ) · φ ( x j ) ; SVM is just a linear classifier in that space. Deep Learning Tutorial,Part I 26

  7. Deep learning: introduction Shallow learning in vision Image classification with spatial pyramids: φ is based on (1) computing SIFT descriptors over a set of points, (2) clustering descriptors, (3) computing cluster assignment histograms over various regions, (4) concatenating the histograms. Deformable parts model: φ is based on a set of filters, and a linear classifier on top. No hierarchy. Deep Learning Tutorial,Part I 27

  8. Deep learning: introduction Deep learning: definition A system that employs a hierarchy of features of the input, learned end-to-end jointly with the predictor. f y ( x ) = F L ( F L − 1 ( F L − 2 ( · · · F 1 ( x ) · · · ))) Learning methods that are not deep: SVMs nearest neighbor classifiers decision trees perceptron Deep Learning Tutorial,Part I 28

  9. Deep learning: introduction Power of two layers Theoretical result [Cybenko, 1989]: 2-layer net with linear output (sigmoid hidden units) can approximate any continuous function over compact domain to arbitrary accuracy (given enough hidden units!) Examples: 3 hidden units with tanh( z ) = e 2 z − 1 e 2 z +1 activation [from Bishop] Deep Learning Tutorial,Part I 29

  10. Deep learning: introduction Intuition: advantages of depth What can we gain from depth? Example: parity of n -bit numbers, with AND, OR, NOT, XOR gates Trivial shallow architecture: express parity as DNF or CNF but need exponential number of gates! Deep architecture: a tree of XOR gates Deep Learning Tutorial,Part I 30

  11. Deep learning: introduction Advantages of depth Distributed representations through hierarchy of features [Y. Bengio] Deep Learning Tutorial,Part I 31

  12. Deep learning: introduction History of deep learning 1950s: Perceptron (Rosenblatt) 1960s: first AI winter? Minsky and Pappert 1970s-1980s: connectionist models; backprop late 1980s: second AI winter most of modern deep learning discovered! early 2000s: revival of interest (CIFAR groups) ca. 2005: layer-wise pretraining of deep-ish nets 2010: progress in speech and vision with deen neural nets 2012: Krizhevsky et al. win ImageNet Deep Learning Tutorial,Part I 32

  13. Deep learning: introduction Neural networks General form of shallow linear classifiers: score is computed as f y ( x ; w , b ) = w y · φ ( x ) + b y y = 1 y = C φ 0 ≡ 1 b 1 Representation as a neural network : w 1 , 1 w 1 , 2 w 1 ,m . . . φ 1 φ 2 φ m . . . x 1 x 2 x d Deep Learning Tutorial,Part I 33

  14. Deep learning: introduction Neural networks General form of shallow linear classifiers: score is computed as f y ( x ; w , b ) = w y · φ ( x ) + b y y = 1 y = C b C Representation as a neural network : w C, 1 w C, 2 w C,m Weights . . . φ 1 φ 2 φ m w = [ w 1 , . . . , w C ] , w c ∈ R m Biases b = [ b 1 , . . . , b C ] . . . x 1 x 2 x d Deep Learning Tutorial,Part I 33

  15. Deep learning: introduction Two-layer network b (2) 1 w (2) w (2) w (2) m 2 b (1) 1 1 . . . h h h w (1) w (1) w (1) w (1) 1 , 1 2 , 1 d, 1 d,m . . . x 1 x 2 x d Idea: learn parametric features φ j ( x ) = h ( w (1) · x + b (1) j ) for some j nonlinear function h Deep Learning Tutorial,Part I 34

  16. Deep learning: introduction Feed-forward networks Feedforward operation, from input x to output . . . y : ˆ h � d . . . m � � w (2) � w (1) i,j x i + b (1) + b (2) f y ( x ) = j,y h y j j =1 i =1 In matrix form: f ( x ) = W 2 · h ( W 1 · x + b 1 ) + b 2 where h is applied elementwise; x ∈ R d , W 1 ∈ R m × d , W 2 ∈ R C × m , b 2 ∈ R C , b 1 ∈ R m Deep Learning Tutorial,Part I 35

  17. Deep learning: introduction Learning a neural network f ( x ) = W 2 · h ( W 1 · x + b 1 ) + b 2 � recall: ˆ p ( y = c | x ) = exp( f c ( x )) / exp( f j ( x )) j Softmax loss computed on f ( x ) vs. true label y : � L ( x , y ) = − log ˆ p ( y | x ) = − f y ( x ) + log exp ( f c ( x )) c Learning the network: initialize, then run [stochastic] gradient descent, updating according to ∂L ∂L ∂L ∂L , , , ∂ w 2 ∂ b 2 ∂ W 1 ∂ b 1 Deep Learning Tutorial,Part I 36

  18. Deep learning: introduction Chain rule review: vectors Consider the chain (stage-wise) mapping R d R m R c R f g h x u v z Computing partial gradients: ∇ v z = ∂z ∂ v Deep Learning Tutorial,Part I 37

  19. Deep learning: introduction Chain rule review: vectors Consider the chain (stage-wise) mapping R d R m R c R f g h x u v z Computing partial gradients: ∇ v z = ∂z ∂ v � ′ ∂z ∂z ∂v j � ∂ v � = ⇒ ∇ u z = ∇ v z ∂u i ∂v j ∂u i ∂ u j Deep Learning Tutorial,Part I 37

  20. Deep learning: introduction Chain rule review: vectors Consider the chain (stage-wise) mapping R d R m R c R f g h x u v z Computing partial gradients: ∇ v z = ∂z ∂ v � ′ ∂z ∂z ∂v j � ∂ v � = ⇒ ∇ u z = ∇ v z ∂u i ∂v j ∂u i ∂ u j � ′ ∂z ∂z ∂u q � ∂ u � = ⇒ ∇ x z = ∇ u z ∂x k ∂u q ∂x k ∂ x q Deep Learning Tutorial,Part I 37

  21. Deep learning: introduction Chain rule review: tensors More generally, some of the variables are tensors R d 1 ×···× d x R m 1 ×···× m u R c R f g h v z X U ∇ X z is a tensor, same dim as X Use single index to indicate index tuples: e.g., if X is 3D, i = ( i 1 , i 2 , i 3 ) ∂z ( ∇ X z ) i = ∂x i 1 ,i 2 ,i 3 Now, � ∇ X z = ( ∇ X U j ) ( ∇ U z ) j j Deep Learning Tutorial,Part I 38

  22. Deep learning: introduction Staged feedforward computation To make derivations more convenient, we will express forward computation ( x , y ) → L in more detail: x , W 1 , b 1 → a 1 = W 1 · x + b 1 a 1 → z 1 = h ( a 1 ) z 1 , W 2 , b 2 → a 2 = W 2 · z 1 + b 2 a 2 , y → L = − e y · a 2 + log [ 1 · exp( a 2 )] Now we have, e.g., What is ∇ W 1 z 1 ,j like? Deep Learning Tutorial,Part I 39

  23. Deep learning: introduction Staged feedforward computation To make derivations more convenient, we will express forward computation ( x , y ) → L in more detail: x , W 1 , b 1 → a 1 = W 1 · x + b 1 a 1 → z 1 = h ( a 1 ) z 1 , W 2 , b 2 → a 2 = W 2 · z 1 + b 2 a 2 , y → L = − e y · a 2 + log [ 1 · exp( a 2 )] Now we have, e.g., � ′ � ∂ a 2 ∇ z 1 L = ∇ a 2 L, ∂ z 1 What is ∇ W 1 z 1 ,j like? Deep Learning Tutorial,Part I 39

  24. Deep learning: introduction Staged feedforward computation To make derivations more convenient, we will express forward computation ( x , y ) → L in more detail: x , W 1 , b 1 → a 1 = W 1 · x + b 1 a 1 → z 1 = h ( a 1 ) z 1 , W 2 , b 2 → a 2 = W 2 · z 1 + b 2 a 2 , y → L = − e y · a 2 + log [ 1 · exp( a 2 )] Now we have, e.g., � ′ � ∂ a 2 ∇ z 1 L = ∇ a 2 L, ∂ z 1 � ∇ W 1 L = ( ∇ W 1 z 1 ,j ) ( ∇ z 1 L ) j j What is ∇ W 1 z 1 ,j like? Deep Learning Tutorial,Part I 39

  25. Backpropagation Backpropagation: general network General unit activation in a network (ignoring bias) Unit t receives input from I ( t ) = { i 1 , . . . , i S } , sends to O ( t ) = { o 1 , . . . , o R } L � f 1 . . . a t = w jt z j f C j ∈ I ( t ) . . . . . . z t = h ( a t ) . . . . . . . . . . . . z o 1 z o R The loss L depends on w jt only through a t w t,o 1 w t,o R z t ∂L = ∂L ∂a t w i 1 ,t w i 2 ,t w i S ,t ∂w jt ∂a t ∂w jt . . . z i 1 z i 2 z i S Deep Learning Tutorial,Part I 40

  26. Backpropagation Backpropagation: general network General unit activation in a network (ignoring bias) Unit t receives input from I ( t ) = { i 1 , . . . , i S } , sends to O ( t ) = { o 1 , . . . , o R } L � f 1 . . . a t = w jt z j f C j ∈ I ( t ) . . . . . . z t = h ( a t ) . . . . . . . . . . . . z o 1 z o R The loss L depends on w jt only through a t w t,o 1 w t,o R z t ∂L = ∂L ∂a t w i 1 ,t w i 2 ,t w i S ,t ∂w jt ∂a t ∂w jt . . . = ∂L z i 1 z i 2 z i S z j ∂a t Deep Learning Tutorial,Part I 40

  27. Backpropagation Backpropagation: general network Starting with L , compute backward L (gradient) flow f 1 . . . f C Note: . . . . . . � a j = w i,j h ( a i ) . . . i ∈ I ( j ) . . . . . . Notation: d t = ∂L . . . z o 1 z o R ∂a t The backward flow comes to unit t w t,o 1 w t,o R from O ( t ) : z t ∂L ∂a o w i 1 ,t w i 2 ,t w i S ,t � d t = ∂a o ∂a t . . . z i 1 z i 2 z i S o ∈ O ( t ) � d o w t,o h ′ ( a t ) = o ∈ O ( t ) Deep Learning Tutorial,Part I 41

  28. Backpropagation Backpropagation: general network Starting with L , compute backward L (gradient) flow f 1 . . . f C Note: . . . . . . � a j = w i,j h ( a i ) . . . i ∈ I ( j ) . . . . . . Notation: d t = ∂L . . . z o 1 z o R ∂a t The backward flow comes to unit t w t,o 1 w t,o R from O ( t ) : z t ∂L ∂a o w i 1 ,t w i 2 ,t w i S ,t � d t = ∂a o ∂a t . . . z i 1 z i 2 z i S o ∈ O ( t ) � � d o w t,o h ′ ( a t ) = h ′ ( a t ) = d o w t,jo o ∈ O ( t ) o ∈ O ( t ) Deep Learning Tutorial,Part I 41

  29. Backpropagation Multilayer networks Consider a layer t with n t units z t = h ( W t z t − 1 + b t ) where z t ∈ R n t , b t ∈ R n t , W t ∈ R n t × n t − 1 h is applied element-wise Layer zero reads off input z 0 ≡ x Last layer T produces a linear output z T = W T z T − 1 + b T which is used to predict/assess loss (a.k.a. f ) Deep Learning Tutorial,Part I 42

  30. Backpropagation Feed-forward pass Compute a 1 = W 1 · x + b 1 z 1 = h ( a 1 ) Deep Learning Tutorial,Part I 43

  31. Backpropagation Feed-forward pass Compute a 1 = W 1 · x + b 1 z 1 = h ( a 1 ) a 2 = W 2 · z 1 + b 2 z 2 = h ( a 2 ) Deep Learning Tutorial,Part I 43

  32. Backpropagation Feed-forward pass Compute a 1 = W 1 · x + b 1 z 1 = h ( a 1 ) a 2 = W 2 · z 1 + b 2 z 2 = h ( a 2 ) . . . a T − 1 = W T − 1 · z T − 2 + b T − 1 z T − 1 = h ( a T − 1 ) Deep Learning Tutorial,Part I 43

  33. Backpropagation Feed-forward pass Compute a 1 = W 1 · x + b 1 z 1 = h ( a 1 ) a 2 = W 2 · z 1 + b 2 z 2 = h ( a 2 ) . . . a T − 1 = W T − 1 · z T − 2 + b T − 1 z T − 1 = h ( a T − 1 ) a T = W T · z T − 1 + b T z T = a T Training: Compute L ( z T , y ) Testing: make inference y ( x ) = argmax ˆ z T,c c Deep Learning Tutorial,Part I 43

  34. Backpropagation Backward pass The main backprop equations: ∂L d t = h ′ ( a t ) � d j w t,j , = d t z i ∂w i,t j ∈ O ( t ) Compute gradient information, using cached z and a d T = ∂L ∇ W T L = d T ⊗ z T − 1 a T u ⊗ v : outer prod; Deep Learning Tutorial,Part I 44

  35. Backpropagation Backward pass The main backprop equations: ∂L d t = h ′ ( a t ) � d j w t,j , = d t z i ∂w i,t j ∈ O ( t ) Compute gradient information, using cached z and a d T = ∂L ∇ W T L = d T ⊗ z T − 1 a T d T − 1 = h ′ ( a T − 1 ) ∗ d ′ � � T W T ∇ W T − 1 L = d T − 1 ⊗ z T − 2 u ⊗ v : outer prod; u ∗ v : elt-wise Deep Learning Tutorial,Part I 44

  36. Backpropagation Backward pass The main backprop equations: ∂L d t = h ′ ( a t ) � d j w t,j , = d t z i ∂w i,t j ∈ O ( t ) Compute gradient information, using cached z and a d T = ∂L ∇ W T L = d T ⊗ z T − 1 a T d T − 1 = h ′ ( a T − 1 ) ∗ d ′ � � T W T ∇ W T − 1 L = d T − 1 ⊗ z T − 2 . . . d t = h ′ ( a t ) ∗ d ′ � � t +1 W t +1 ∇ W t L = d t ⊗ z t − 1 u ⊗ v : outer prod; u ∗ v : elt-wise Deep Learning Tutorial,Part I 44

  37. Backpropagation Modularity Basic building block of a neural network: a layer, which defines two directions of computation Forward: pull activations from input units; compute and cache “raw” activation a ; compute and output activation z = h ( a ) Backward: collect gradient information d from output units; calculate gradient w.r.t. a ; calculate gradient w.r.t. weights and biases The only connections between layers is in communicating z and d Deep Learning Tutorial,Part I 45

  38. Backpropagation Computational graph When implementing backpropagation, existing software falls into two groups Numerical: interface between layers handles numbers; computation must be fully specified Torch,MatConvnet,Caffe Symbolic: interface includes derivatives (and intermediate stages) as first-class citizens. Computation is specified by a computational graph Theano, TensorFlow, Caffe2 [Goodfellow et al] Deep Learning Tutorial,Part I 46

  39. Activation functions Choice of non-linearity: 1070s-2010 1 sigmoid : [ h ( a ) = h ( a ) = tanh( a ) 1 + exp( a ) Good: squash activations to a fixed range Bad: gradient is nearly zero far away from midpoint ∂L ∂L dh ∂a = da ≈ 0 ∂h ( a ) can make learning very, very slow tanh (zero-centered) is preferable to sigmoid Deep Learning Tutorial,Part I 47

  40. Activation functions 2010: RELU Intuition: make the non-linearity non-saturating, at least in part of the range Rectified linear units: h ( a ) = max(0 , a ) Good: non-saturating; cheap to compute; greatly speeds up convergence compared to sigmoid (order of magnitude) Deep Learning Tutorial,Part I 48

  41. Activation functions RELU and dead units Problem: if RELU gets into a state in which all batches in an epoch have zero activation, the units becomes stuck with zero gradient (“dies”). [A. Karpathy] Deep Learning Tutorial,Part I 49

  42. Activation functions RELU variants Many attempts to improve RELUs: Leaky RELU: h ( a ) = max ( αa, a ) Learning α : Parametric RELU Exponential RELUs: � a if a ≥ 0 , h ( a ) = α (exp( a ) − 1) if a < 0 [A. Karpathy] ELU are promising, but more expensive to compute RELU still the default choice; none of the variants are consistently better Deep Learning Tutorial,Part I 50

  43. Initialization Random initialization Non-convex objective; initialization is important Can we initialize with all zeros? bad idea: all units will learn the same thing! Can initialize weights with small real numbers, e.g., drawn from Gaussian with zero mean, variance 0.01 Problem: variance of activation grows with number of inputs [A. Karpathy] Deep Learning Tutorial,Part I 51

  44. Initialization Xavier initialization Idea: normalize the scale to provide roughly equal variance throughout the network the Xavier initialization [Glorot et al]: if unit has n inputs, draw from zero mean, variance 1 /n [A. Karpathy] Deep Learning Tutorial,Part I 52

  45. Initialization Initialization and RELU Assumption behind Xavier: (1) linear activations, (2) zero mean activations. Breaks when using RELUs: [A. Karpathy] Deep Learning Tutorial,Part I 53

  46. Initialization Kaiming initialization Initialization scheme specifically for RELUs [He et al.]: zero mean, variance 2 /n where n is the number of inputs. The Kaiming initialization currently recommended for RELU units [A. Karpathy] Note: still OK to init biases with zeros Deep Learning Tutorial,Part I 54

  47. Optimization tricks Basic stochastic gradient descent Learning hyperparameters: architecture, regularizer R Hyperparameters: learning rate η , batch size B Initialize weights and biases Each epoch: shuffle data, partition into batches, iterate over batches b w = w − η [ ∇ w L ( X b , Y b ) + ∇ w R ( w )] We have covered initialization Next: optimization Deep Learning Tutorial,Part I 55

  48. Optimization tricks Learning rate Generally, for convex functions, gradient descent will converge Setting the learning rate η may be very important to ensure rapid convergence from Lecun et al, 1996 Deep Learning Tutorial,Part I 56

  49. Optimization tricks Learning rate for neural networks For deep networks, setting the right learning rate is crucial. Typical behaviors, monitoring training loss : [A. Karpathy] Deep Learning Tutorial,Part I 57

  50. Optimization tricks Learning rate schedules Generally, as with convex functions, we want the learning rate to decay with time Could set up automatic schedule, e.g., drop by a factor of alpha every β epochs. Most common in practice: some degree of babysitting start with a resonable learning rate monitor training loss drop LR (typically 1/10) when learning appears stuck Deep Learning Tutorial,Part I 58

  51. Optimization tricks Monitoring training loss Too expensive to evaluate on entire training set frequently; instead, use rolling average of batch loss value Typical behavior: the red line [Larsson et al.] A few caveats: wait a bit before dropping; remember that this is surrogate loss on training (monitor validation accuracy as a precaution) better yet, drop LR based on val accuracy, not training loss do a sanity check for loss values Crashes due to NaNs etc. often due to high LR Deep Learning Tutorial,Part I 59

  52. Optimization tricks Gradient Descent with Momentum SGD(GD) has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. [A. Karpathy] SGD oscillates across the slopes of the ravine, only making hesitant progress towards the local optimum. Momentum is a method that helps accelerate SGD(GD) in the relevant direction and dampens oscillations. Deep Learning Tutorial,Part I 60

  53. Optimization tricks Gradient Descent with Momentum SGD(GD) has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. [A. Karpathy] SGD oscillates across the slopes of the ravine, only making hesitant progress towards the local optimum. Momentum is a method that helps accelerate SGD(GD) in the relevant direction and dampens oscillations. ∆ w t = γ ∆ w t − 1 + η t ∇ f ( w t ) w t +1 = w t − ∆ w t Deep Learning Tutorial,Part I 60

  54. Optimization tricks Gradient Descent with Momentum Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ < 1 ). Deep Learning Tutorial,Part I 61

  55. Optimization tricks Gradient Descent with Momentum Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ < 1 ). The momentum term increases for dimensions whose gradients point in the same directions, reduces updates for dimensions whose gradients change directions. Faster convergence, reduced oscillation [Goodfellow et al] Deep Learning Tutorial,Part I 61

  56. Optimization tricks AdaGrad Intuition [Duchi et al.]: parameters (directions of the parameter space) are updated with varying frequency Idea: reduce learning rate in proportion to the updates Maintain cache s i for each parameter θ i ; when updating, � ∂L � 2 s i = s i + ∂θ i w i = w i − η ∂L / ( √ s i + ǫ ) ∂θ i Rarely used today (reduces rate too aggresively) Deep Learning Tutorial,Part I 62

  57. Optimization tricks RMSprop Modified idea from Adagrad; ‘[“published” in Hinton’s Coursera slides] Cached rate allows for “forgetting” � ∂L � 2 s i = δs i + (1 − δ ) , ∂θ i w i = w i − η ∂L / ( √ s i + ǫ ) ∂θ i The decay rate δ is typically 0.9–0.99 Deep Learning Tutorial,Part I 63

  58. Optimization tricks Adam optimizer Kind of like RMSprop with momentum [Kingma et al.] First order momentum update for w i : m i = β 1 m i + (1 − β 1 ) ∂L ∂θ i Second order: � ∂L � 2 v i = β 2 v i + (1 − β 2 ) ∂θ i Parameter update: m w i = w i − η √ v + ǫ Deep Learning Tutorial,Part I 64

  59. Optimization tricks Warm start Suppose we want to continue training network for more epochs All significant platforms allow for saving snapshots and resuming Need to be careful with learning rate: if re-initialize to high, might lose our place in parameter space Technical issues with momentum, Adam etc. – need to save relevant data in the snapshots to resume! Deep Learning Tutorial,Part I 65

  60. Regularization Review: regularization Main challenge in machine learning: overfitting Bias-variance tradeoff: complex models reduce bias (approximation error), but increase variance (estimation error) Optimization error: source of concern when dealing with non-convex models Bayes error: presumed low in vision tasks (?) Deep Learning Tutorial,Part I 66

  61. Regularization Review: regularization Regularization as a way to control the bias-variance tradeoff General form of regularized ERM for model class M : �� � min L ( y i , M ( x i )) + λR ( M ) M ∈M i For parametric models, choice of M determined by setting value to some w ∈ R D d | w d | p Most common form of regularizer R : norm � (shrinkage) Deep Learning Tutorial,Part I 67

  62. Regularization Review: geometry of regularization Can write unconstrained optimization problem N m � � | w j | p min − log ˆ p ( y i | x i ; w ) + λ w i =1 j =1 as an equivalent constrained problem ℓ w 2 ˆ w lasso N w 2 1 + w 2 � min − log ˆ p ( y i | x i ; w ) 2 w ˆ w ML i =1 m | w j | p ≤ t � subject to w 1 j =1 | w 1 | + | w 2 | ˆ w ridge p = 1 may lead to sparsity, p = 2 generally won’t Deep Learning Tutorial,Part I 68

  63. Regularization Effect of regularization Cartoon of the effect of regularization on bias and variance: 0.15 (bias) 2 0.12 variance (bias) 2 + variance 0.09 test error 0.06 0.03 0 −3 −2 −1 0 1 2 ln λ [Bishop] In practice, curves are less clean Deep Learning Tutorial,Part I 69

  64. Regularization Weight decay In neural networks, L 2 regularization is called weight decay Note: bias is normally not regularized Easy to incorporate weight decay into the gradient calculation ∇ w λ � w � 2 = 2 λ w One more hyperparameter λ to tune with large data sets, typically seems inconsequential Deep Learning Tutorial,Part I 70

  65. Regularization Dropout Part of overfitting in a neural net: unit co-adaptation idea: prevent it by disrupting co-firing patterns Dropout [Srivastava et al]: during each training iteration, randomly “remove” units, just for that update Deep Learning Tutorial,Part I 71

  66. Regularization Dropout as regularizer With each particular dropout set, we have a different network Interpretation: training an ensemble of networks with shared parameters [Goodfellow et al.] Deep Learning Tutorial,Part I 72

  67. Regularization Dropout: implementation Dropout introduces a discrepancy between train and test Correction: suppose survival rate of a unit is p must scale weights from unit (after training) by p Modern version: “inverse dropout” scale activations by 1 /p during training, do not scale during test Typically employed in top layers; p = 0 . 5 is most common Deep Learning Tutorial,Part I 73

  68. Convolutional networks Sparse weight patterns One way to regularize is to sparsify parameters We can set a subset of weights to zero If the input has “spatial” semantics: can keep only weights for a contiguous set of inputs [Goodfellow et al.] Deep Learning Tutorial,Part I 74

  69. Convolutional networks Receptive field Each unit in upper layer is affected only by a subset of units in lower layer – its receptive field [Goodfellow et al.] Conversely: each unit in lower layer only influences a subset of units in upper layer Deep Learning Tutorial,Part I 75

  70. Convolutional networks Receptive field growth Very important: receptive field size w.r.t. the input in locally connected networks grows with layers, even if in each layer it is fixed [Goodfellow et al.] Deep Learning Tutorial,Part I 76

  71. Convolutional networks Locally connected + parameter sharing We can further reduce network complexity by tying weights for all receptive fields in a layer not tied tied Goodfellow et al. We have now introduced convolutional layers weight sharing induces equivariance to translation! Deep Learning Tutorial,Part I 77

  72. Convolutional networks 2D convolutions Note: filters are not flipped Deep Learning Tutorial,Part I 78 [Goodfellow et al.]

  73. Convolutional networks Convolutional layer operation Suppose the input to the layer (output of previous layer) has C channels tensor W × H × C we convolve only along spatial dimensions: assuming valid convolution, with k × k × C filter (must match #channels!) we get ( W − k + 1) × ( H − k + 1) × 1 activation map [A. Karpathy] If we have m filters, we get ( W − k + 1) × ( H − k + 1) × m map as the output of the layer Deep Learning Tutorial,Part I 79

  74. Convolutional networks Implementation: efficient convolutions Most common: convert convolution into matrix multiplication parallel, GPU-friendly! Suppose we have m filters k 1 , . . . , k m of size f × f , with c channels. Basic idea: pre-compute index mapping im2col : Z ∈ R S × S × c → M ∈ R ( S − f +1) 2 × k 2 c that maps every receptive field to a column in M Collect filters as columns of K ∈ R f 2 c Now simply compute MK + b and reshape to ( S − f + 1) × ( S − f + 1) × m Notably, for some cases (in particular small filters, 3 × 3 or 5 × 5 ) more efficient implementations use FFT Most software uses 3rd party (Nvidia, Nervana) implementations under the hood Deep Learning Tutorial,Part I 80

  75. Convolutional networks Conv layer sizing If we simplye rely on valid convolutions, the maps will quickly shrink [Goodfellow et al] Instead, we usually pad with zeros [Goodfellow et al] Usual padding is symmetric, with ( f − 1) / 2 – same convolution in Matlab speak Deep Learning Tutorial,Part I 81

  76. Convolutional networks Convolution size Two extreme cases: Filter size equal to output map size of the previous layer ⇒ fully connected layer (more on this later) Filter size is 1 × 1 ⇒ The layer simply computes a (non-linear) projection of the features computed by previous layer A. Karpathy Deep Learning Tutorial,Part I 82

  77. Convolutional networks Stride and size Convolution with stride > 1 is a cheap way to reduce spatial dimension of output Can be implemented as convolution followed by downsampling used in LeNet wasteful! Goodfellow et al. Modern implementations explicitly specify stride [Goodfellow et al] Note: matches the multiplication implementation of conv well! Deep Learning Tutorial,Part I 83

  78. Convolutional networks Pooling Pooling applies a non-parameterized operation on a receptive field Most common operation: max and average Typically, pooling is applied with a stride > 1, to reduce spatial resolution but is possible to have stride of 1! Deep Learning Tutorial,Part I 84

  79. Convolutional networks Maxout pooling Idea: pool over feature channels , not spatially Introduces invariance w.r.t. a family of filters [Goodfellow et al.] Deep Learning Tutorial,Part I 85

Recommend


More recommend