deep learning
play

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - PowerPoint PPT Presentation

Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude & a


  1. Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10

  2. Applied Math for Deep Learning • Linear Algebra • Probability • Calculus • Optimization

  3. Linear Algebra • Scalar − real numbers • Vector (1D) − Has a magnitude & a direction • Matrix (2D) − An array of numbers arranges in rows & columns • Tensor (>=3D) − Multi-dimensional arrays of numbers

  4. Real-world examples of Data Tensors • Timeseries Data – 3D (samples, timesteps, features) • Images – 4D (samples, height, width, channels) • Video – 5D (samples, frames, height, width, channels) 4

  5. The Matrix

  6. Matrix • Define a matrix with m rows and n columns: Santanu Pattanayak , ”Pro Deep Learning with TensorFlow,” Apress, 2017

  7. Matrix Operations • Addition and Subtraction

  8. Matrix Multiplication • Two matrices A and B, where • The columns of A must be equal to the rows of B, i.e. n == p q n • A * B = C, where q • p m m

  9. Example of Matrix Multiplication (3-1) https://www.mathsisfun.com/algebra/matrix-multiplying.html

  10. Example of Matrix Multiplication (3-2) https://www.mathsisfun.com/algebra/matrix-multiplying.html

  11. Example of Matrix Multiplication (3-3) https://www.mathsisfun.com/algebra/matrix-multiplying.html

  12. Matrix Transpose https://en.wikipedia.org/wiki/Transpose

  13. Dot Product • Dot product of two vectors become a scalar • Notation: 𝑤 1 ∙ 𝑤 2 or 𝑤 1𝑈 𝑤 2

  14. Dot Product of Matrix

  15. Linear Independence • A vector is linearly dependent on other vectors if it can be expressed as the linear combination of other vectors • A set of vectors 𝑤 1 , 𝑤 2 , ⋯ , 𝑤 𝑜 is linearly independent if 𝑏 1 𝑤 1 + 𝑏 2 𝑤 2 + ⋯ + 𝑏 𝑜 𝑤 𝑜 = 0 implies all 𝑏 𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}

  16. Span the Vector Space • n linearly independent vectors can span n -dimensional space

  17. Rank of a Matrix • Rank is: − The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns • Row rank = Column rank • Example: Row- echelon form https://en.wikipedia.org/wiki/Rank_(linear_algebra)

  18. Identity Matrix I • Any vector or matrix multiplied by I remains unchanged • For a matrix 𝐵 𝑛×𝑜 , 𝐵𝐽 𝑜 = 𝐽 𝑛 𝐵 = 𝐵

  19. Inverse of a Matrix • The product of a square matrix 𝐵 and its inverse matrix 𝐵 −1 produces the identity matrix 𝐽 • 𝐵𝐵 −1 = 𝐵 −1 𝐵 = 𝐽 • Inverse matrix is square, but not all square matrices has inverses

  20. Pseudo Inverse • Non-square matrix and have left-inverse or right-inverse matrix • Example: 𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ 𝑛×𝑜 , 𝑐 ∈ ℝ 𝑜 − Create a square matrix 𝐵 𝑈 𝐵 𝐵 𝑈 𝐵𝑦 = 𝐵 𝑈 𝑐 − Multiplied both sides by inverse matrix (𝐵 𝑈 𝐵) −1 𝑦 = (𝐵 𝑈 𝐵) −1 𝐵 𝑈 𝑐 − (𝐵 𝑈 𝐵) −1 𝐵 𝑈 is the pseudo inverse function

  21. Norm • Norm is a measure of a vector’s magnitude • 𝑚 2 norm • 𝑚 1 norm • 𝑚 𝑞 norm • 𝑚 ∞ norm

  22. Unit norms in 2D Vectors • The set of all vectors of norm 1 in different 2D norms

  23. L1 and L2 Regularization subject to https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

  24. Eigen Vectors • Eigenvector is a non-zero vector that changed by only a scalar factor λ when linear transform 𝐵 is applied to: 𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ 𝑜×𝑜 , 𝑦 ∈ ℝ 𝑜 • 𝑦 are Eigenvectors and 𝜇 are Eigenvalues • One of the most important concepts for machine learning, ex: − Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …

  25. Example: Shear Mapping • Horizontal axis is the Eigenvector

  26. Power Iteration Method for Computing Eigenvector Start with random vector 𝑤 1. Calculate iteratively: 𝑤 (𝑙+1) = 𝐵 𝑙 𝑤 2. After 𝑤 𝑙 converges, 𝑤 (𝑙+1) ≅ 𝑤 𝑙 3. 𝑤 𝑙 will be the Eigenvector with largest Eigenvalue 4.

  27. NumPy for Linear Algebra • NumPy is the fundamental package for scientific computing with Python. It contains among other things: − a powerful N-dimensional array object − sophisticated (broadcasting) functions − tools for integrating C/C++ and Fortran code − useful linear algebra, Fourier transform, and random number capabilities

  28. Python & NumPy tutorial • http://cs231n.github.io/python-numpy-tutorial/ • Stanford CS231n: Convolutional Neural Networks for Visual Recognition − http://cs231n.stanford.edu/

  29. Create Tensors Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)

  30. Create 3D Tensor

  31. Attributes of a Tensor • Number of axes (dimensions) − x.ndim • Shape − This is a tuple of integers showing how many data the tensor has along each axis • Data type − uint8, float32 or float64

  32. Manipulating Tensors in Numpy 32

  33. Displaying the Fourth Digit 33

  34. Real-world examples of Data Tensors • Vector data – 2D (samples, features) • Timeseries Data – 3D (samples, timesteps, features) • Images – 4D (samples, height, width, channels) • Video – 5D (samples, frames, height, width, channels) 34

  35. Batch size & Epochs • A sample − A sample is a single row of data • Batch size − Number of samples used for one iteration of gradient descent − Batch size = 1: stochastic gradient descent − 1 < Batch size < all: mini-batch gradient descent − Batch size = all: batch gradient descent • Epoch − Number of times that the learning algorithm work through all training samples 35

  36. Element-wise Operations for Matrix • Operate on each element

  37. NumPy Operation for Matrix • Leverage the Basic Linear Algebra subprograms (BLAS) • BLAS is optimized using C or Fortran

  38. Broadcasting • Apply smaller tensor repeated to the extra axes of the larger tensor 38

  39. Tensor Dot

  40. Implementation of Dot Product

  41. Tensor Reshaping • Rearrange a tensor’s rows and columns to match a target shape

  42. Matrix Transposition • Transposing a matrix means exchanging its rows and its columns

  43. Unfolding the Manifold • Tensor operations are complex geometric transformation in high- dimensional space − Dimension reduction

  44. Differentiation OR

  45. Gradient of a Function • Gradient is a multi-variable generalization of the derivative • Apply partial derivatives • Example

  46. Hessian Matrix • Second-order partial derivatives

  47. Maxima and Minima for Univariate Function 𝑒𝑔(𝑦) • If = 0 , it’s a minima or a maxima point, then we study the 𝑒𝑦 second derivative: 𝑒 2 𝑔(𝑦) − If < 0 => Maxima 𝑒𝑦 2 𝑒 2 𝑔(𝑦) − If > 0 => Minima 𝑒𝑦 2 𝑒 2 𝑔(𝑦) − If = 0 => Point of reflection 𝑒𝑦 2 Minima

  48. Maxima and Minima for Multivariate Function • Computing the gradient and setting it to zero vector would give us the list of stationary points. • For a stationary point 𝑦 0 ∈ ℝ 𝑜 − If the Hessian matrix of the function at 𝑦 0 has both positive and negative eigen values, then 𝑦 0 is a saddle point − If the eigen values of the Hessian matrix are all positive then the stationary point is a local minima − If the eigen values are all negative then the stationary point is a local maxima

  49. Chain Rule 50

  50. Symbolic Differentiation Computation Graph: c = a + b d = b + 1 e = c*d 51

  51. Stochastic Gradient Descent 1. Draw a batch of training samples x and corresponding targets y 2. Run the network on x to obtain predictions y_pred 3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y 4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass). 5. Move the parameters a little in the opposite direction from the gradient: W -= step * gradient

  52. Gradient Descent along a 2D Surface

  53. Avoid Local Minimum using Momentum

  54. Basics of Probability

  55. Three Axioms of Probability 𝑂 • Given an Event 𝐹 in a sample space 𝑇, S =ڂ 𝑗=1 𝐹 𝑗 • First axiom − 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1 • Second axiom − 𝑄 𝑇 = 1 • Third axiom − Additivity, any countable sequence of mutually exclusive events 𝐹 𝑗 − 𝑄ڂ 𝑗=1 𝑜 𝑜 𝐹 𝑗 = 𝑄 𝐹 1 + 𝑄 𝐹 2 + ⋯ + 𝑄 𝐹 𝑜 = σ 𝑗=1 𝑄 𝐹 𝑗

  56. Union, Intersection, and Conditional Probability • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 • 𝑄 𝐵 ∩ 𝐶 is simplified as 𝑄 𝐵𝐶 • Conditional Probability 𝑄 𝐵|𝐶 , the probability of event A given B has occurred 𝐵𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐶 − 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)

  57. Chain Rule of Probability • The joint probability can be expressed as chain rule

  58. Mutually Exclusive • 𝑄 𝐵𝐶 = 0 • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶

Recommend


More recommend