deep learning basics
play

Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai - PowerPoint PPT Presentation

Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Installations Deep Learning Motivations DeepNumpy & Calculus Regression Optimization Softmax Regression Multilayer Perceptron (train


  1. Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai

  2. Outline • Installations • Deep Learning Motivations • DeepNumpy & Calculus • Regression • Optimization • Softmax Regression • Multilayer Perceptron (train MNIST) d2l.ai

  3. Installations d2l.ai

  4. Installations • Python • Everyone is using it in machine learning • Miniconda • Package manager (for simplicity) • Jupyter Notebook • So much easier to keep track of your experiments d2l.ai

  5. Installations Detailed step-by-step instructions on local (Mac or Linux) : https://d2l.ai/chapter_install/ install.html d2l.ai

  6. Deep Learning d2l.ai

  7. Classify Images d2l.ai http://www.image-net.org/

  8. Classify Images http://www.image-net.org/ Yanofsky, Quartz d2l.ai https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/

  9. Detect and Segment Objects d2l.ai https://github.com/matterport/Mask_RCNN

  10. Style Transfer https://github.com/zhanghang1989/MXNet-Gluon-Style-Transfer/ d2l.ai

  11. Synthesize Faces d2l.ai Karras et al, arXiv 2019

  12. Analogies https://nlp.stanford.edu/projects/glove/ d2l.ai

  13. Machine Translation https://www.pcmag.com/news/349610/google-expands-neural-networks-for-language-translation d2l.ai

  14. Text Synthesis d2l.ai Li et al, NACCL, 2018

  15. Question answering Question Type: “Subordinate Object Recognition” Question Type Guided Vision Feature Extractor Attention Q: “What’s her Text Feature Extractor Combine mustache made of?” A: “Banana” Predictor d2l.ai Shi et al, 2018, Arxiv

  16. Image captioning Shallue et al, 2016 https://ai.googleblog.com/2016/09/show-and-tell- d2l.ai image-captioning-open.html

  17. Problems we will solve: Classification cat dog rabbit gerbil Given image x estimate label y y = f ( x ) where y ∈ {1,… N } d2l.ai

  18. Problems we will solve: Regression 0.4kg 2kg 4kg 10kg Given image x estimate label y y = f ( x ) where y ∈ ℝ d2l.ai

  19. Problems we will solve today: Sequence Models GPT2, 2019 d2l.ai

  20. Deep d2l.ai

  21. N-dimensional Arrays N-dimensional arrays are the main data structure for machine learning and neural networks 0-d (scalar) 1-d (vector) 2-d (matrix) 1.0 [1.0, 2.7, 3.4] [[1.0, 2.7, 3.4] [5.0, 0.2, 4.6] [4.3, 8.5, 0.2]] An example-by- A class label A feature vector feature matrix d2l.ai

  22. N-dimensional Arrays 3-d 4-d 5-d [[[0.1, 2.7, 3.4] [5.0, 0.2, 4.6] [[[[. . . [[[[[. . . [4.3, 8.5, 0.2]] . . . . . . [[3.2, 5.7, 3.4] . . .]]]] . . .]]]]] [5.4, 6.2, 3.2] [4.1, 3.5, 6.2]]] A batch of A batch of videos A RGB image RGB images (batch-size x time x (width x height (batch-size x width x height x x channels) width x height channels) x channels) d2l.ai

  23. Element-wise access element: [1, 2] row: [1, :] column: [1, :] 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 1 5 6 7 8 1 5 6 7 8 1 5 6 7 8 2 9 10 11 12 2 9 10 11 12 2 9 10 11 12 3 13 14 15 16 3 13 14 15 16 3 13 14 15 16 0 1 2 3 0 1 2 3 0 1 2 3 4 0 1 2 3 4 1 5 6 7 8 1 5 6 7 8 2 9 10 11 12 2 9 10 11 12 3 13 14 15 16 3 13 14 15 16 d2l.ai

  24. d2l.ai

  25. Calculus - Derivatives Derivative measures the sensitivity to change of the output value with respect to a change in its input value. E.g. slope of tangent. y x n a sin( x ) exp( x ) log( x ) 1 dy d nx n − 1 dx x 2 = 2 x exp( x ) cos( x ) 0 x dx y = f ( u ), u = g ( x ) y uv u + v x = 1 dy dx + dv du dx v + dv du dy du dx u dx dx du dx d2l.ai

  26. Calculus - Non-differentiable Extend derivative to non-differentiable cases. y = | x | if x > 0 1 ∂ | x | − 1 if x < 0 = slope= - 1 slope=1 ∂ x if x = 0, a ∈ [ − 1,1] a x = 0 if x > 0 1 ∂ Another example: if x < 0 ∂ x max( x ,0) = 0 a if x = 0, a ∈ [0,1] d2l.ai

  27. Calculus - Gradients Gradient is a multi-variable generalization of the derivative. Scalar Vector ∂ y ∂ x x ∈ ℝ n x ∂ y ∂ y y Scalar ∂ x ∂ x (1,) (1, n) ∂ y ∂ y y ∈ ℝ m Vector ∂ x ∂ x (m, 1) (m, n) d2l.ai

  28. Derivatives for vectors x ∈ ℝ n x ∂ y ∂ x ∈ ℝ m × n x ∈ ℝ n , y ∈ ℝ m , ∂ y ∂ y y ∂ x ∂ x ∂ y / ∂ x x 1 y 1 ∂ y ∂ y x 2 y 2 y ∈ ℝ m x = y = ∂ x ∂ x ⋮ ⋮ x n y m ∂ y 1 ∂ y 1 ∂ y 1 ∂ y 1 ∂ x 1 , ∂ x 2 , …, ∂ x n ∂ x ∂ y = dy i ∂ y 2 ∂ y 2 ∂ y 2 ∂ y 2 [ ∂ x ] ij ∂ y ∂ x 1 , ∂ x 2 , …, ∂ x = = ∂ x n ∂ x dx j ⋮ ⋮ ∂ y m ∂ y m ∂ y m ∂ y m ∂ x 1 , ∂ x 2 , …, ∂ x ∂ x n d2l.ai

  29. Derivatives for vectors Let’s do some exercise! E.g. sum ( x ) ∂ y ∥ x ∥ 2 au a y ∂ x ∈ ℝ 1 × n x ∈ ℝ n , y ∈ ℝ , a ∂ u ∂ y 0 and 1 are vectors 0 T 1 T 2 x T ∂ x ∂ x y uv ⟨ u , v ⟩ u + v ∂ y ∂ x + ∂ v ∂ u ∂ x v + ∂ v ∂ u u T ∂ v ∂ x + v T ∂ u ∂ x u ∂ x ∂ x ∂ x d2l.ai

  30. Derivatives for vectors Let’s do some exercise! E.g. ∂ y ∂ x ∈ ℝ m × n y x ∈ ℝ n , y ∈ ℝ m , a x x T A Ax a , a and A are not functions of x ∂ y A T 0 I A 0 and I are matrices ∂ x y a u Au u + v ∂ y ∂ u ∂ x + ∂ v a ∂ u A ∂ u ∂ x ∂ x ∂ x ∂ x d2l.ai

  31. Generalize to Matrices Scalar Vector Matrix x ( n ,1) X ( n , k ) x (1,) ∂ y ∂ y ∂ y (1, n ) ( k , n ) Scalar y (1,) (1,) ∂ X ∂ x ∂ x ∂ y ∂ y ∂ y Vector y ( m , n ) ( m ,1) ( m ,1) ( m , k , n ) ∂ x ∂ x ∂ X ∂ Y Matrix ∂ Y ∂ Y ( m , l , n ) ( m , l ) Y ( m , l ) ( m , l , k , n ) ∂ x ∂ x ∂ X d2l.ai

  32. Chain Rule Scalars ∂ y ∂ x = ∂ y ∂ u y = f ( u ), u = g ( x ) ∂ u ∂ x Vectors ∂ y ∂ x = ∂ y ∂ u ∂ x = ∂ y ∂ y ∂ u ∂ x = ∂ y ∂ y ∂ u ∂ u ∂ x ∂ u ∂ x ∂ u ∂ x Shapes: (1,) (1, n ) (1, n ) (1, k ) ( k , n ) ( m , n ) ( m , k ) ( k , n ) (1, n ) Too many shapes to memory … d2l.ai

  33. Automatic Differentiation Computing derivatives by hand is HARD. Chain rule (evaluate e.g. via backprop) ∂ u n … ∂ u 2 ∂ u 1 ∂ y ∂ x = ∂ y ∂ u n ∂ u n − 1 ∂ u 1 ∂ x Compute graph: • Build explicitly 
 (TensorFlow, MXNet Symbol) • Build implicitly by tracing 
 (Chainer, PyTorch, DeepNumpy) d2l.ai

  34. Automatic Differentiation 2 z = ( ⟨ x , w ⟩ − y ) Computing derivatives by hand is HARD. z = b 2 4 Chain rule (evaluate e.g. via backprop) ∂ u n … ∂ u 2 ∂ u 1 ∂ y ∂ x = ∂ y b = a − y 3 ∂ u n ∂ u n − 1 ∂ u 1 ∂ x Compute graph: a = ⟨ x , w ⟩ 2 • Build explicitly 
 y (TensorFlow, MXNet Symbol) 1 • Build implicitly by tracing 
 (Chainer, PyTorch, DeepNumpy) w x d2l.ai

  35. NumPy & AutoGrad notebook d2l.ai

  36. Regression d2l.ai

  37. $0.41 g3.4xlarge $0.73 g3.8xlarge Can we estimate $1.37 g3.16xlarge prices (time, server, region)? p = w time ⋅ t + w server ⋅ s + w region ⋅ r d2l.ai

  38. ̂ ̂ ̂ Linear Model • Basic version y = w 1 x 1 + w 2 x 2 + … + w n x n + b • Vectorized version y = ⟨ w , x ⟩ + b x = [ x 1 , x 2 , …, x n ] T • n -dimensional inputs w = [ w 1 , w 2 , …, w n ] ⊤ • Weights: • Bias, b • Vectorized version (Closed form) • Add bias as an element in weights w ← [ w b ] y = ⟨ w , x ⟩ X ← [ X , 1 ] d2l.ai

  39. ̂ ̂ ̂ Loss l 2 • Basic version • Basic version y ) = 1 y = w 1 x 1 + w 2 x 2 + … + w n x n + b 2 n ( y − ̂ y ) ℓ ( y , ̂ • Vectorized version • Vectorized version ℓ ( X , y , w , b ) = 1 2 y = ⟨ w , x ⟩ + b y − Xw − b n • Vectorized version (Closed form) • Vectorized version (Closed form) w ← [ w ℓ ( X , y , w ) = 1 b ] 2 X ← [ X , 1 ] y − Xw n y = ⟨ w , x ⟩ + b d2l.ai

  40. 
 Objective Function • Basic version • Objective is to minimize training loss y ) = 1 2 1 n ( y − ̂ y ) ℓ ( y , ̂ 2 loss ⇔ argmin y − Xw argmin n w w • Vectorized version ℓ ( X , y , w , b ) = 1 2 ⇔ ∂ y − Xw − b ∂ w ℓ ( X , y , w ) = 0 n • Vectorized version (Closed form) ⇔ 2 T X = 0 n ( y − Xw ) ℓ ( X , y , w ) = 1 2 y − Xw n − 1 Xy ⇔ w * = ( X T X ) d2l.ai

  41. Linear Model as a Single-layer Neural Network We can stack multiple layers to get deep neural networks d2l.ai

  42. Linear Regression notebook d2l.ai

  43. Optimization momentum negative gradient d2l.ai

  44. Gradient Descent in 1D Consider some continuously differentiable real-valued function 𝑔 : ℝ→ℝ . Using a Taylor expansion we obtain that: f ( x + ϵ ) = f ( x ) + ϵ f ′ � ( x ) + O ( ϵ 2 ) ϵ = − η f ′ � ( x ) Assume we pick a fixed step size ( 𝜃 >0) and choose : f ( x − η f ′ � ( x )) = f ( x ) − η f ′ � 2 ( x ) + O ( η 2 f ′ � 2 ( x )) d2l.ai

  45. Gradient Descent in 1D f ′ � ( x ) ≠ 0 If the derivative does not vanish, we make progress η f ′ � 2 ( x ) > 0 since . Moreover, we can always choose 𝜃 small enough for the higher order terms to become irrelevant. Hence we arrive at: f ( x − η f ′ � ( x )) ⪅ f ( x ) x ← x − η f ′ � ( x ) This means that, if we use to iterate 𝑦 , the value of function 𝑔 ( 𝑦 ) might decline. d2l.ai

Recommend


More recommend