cs 559 machine learning cs 559 machine learning
play

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and - PowerPoint PPT Presentation

1 CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E mail: Philippos Mordohai@stevens edu E-mail:


  1. 1 CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E mail: Philippos Mordohai@stevens edu E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

  2. Overview Overview • Deep Learning Deep Learning – Based on slides by M. Ranzato (mainly), B d lid b M R t ( i l ) S. Lazebnik, R. Fergus and Q. Zhang

  3. Natural Neurons • Human recognition of digits H iti f di it – visual cortices – neuron interaction

  4. Recognizing Handwritten Digits Recognizing Handwritten Digits • How to describe a digit to a computer – "a 9 has a loop at the top and a vertical stroke in a 9 has a loop at the top, and a vertical stroke in the bottom right“ – Algorithmically difficult to describe various 9s Algorithmically difficult to describe various 9s

  5. Perceptrons Perceptrons • Perceptrons • 1950s ~ 1960s Frank Rosenblatt inspired by earlier 1950s 1960s, Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts • Standard model of artificial neurons Standard model of artificial neurons

  6. Binary Perceptrons Binary Perceptrons • Inputs • Multiple binary inputs Multiple binary inputs • Parameters • Thresholds & weights Th h ld & i ht • Outputs • Thresholded weighted linear combination

  7. Layered Perceptrons Layered Perceptrons • Layered, complex model • 1 st layer 2 nd layer of 1 layer, 2 layer of perceptrons • Perceptron rule Perceptron rule • Weights, thresholds • Similarity to logical Si il it t l i l functions (NAND)

  8. Sigmoid Neurons Sigmoid Neurons • Sigmoid neurons • Stability Stability • Small perturbation, small output change • Continuous inputs • Continuous outputs p • Soft thresholds

  9. Output Functions Output Functions • Sigmoid neurons • Output • Output • Sigmoid vs conventional g thresholds

  10. Smoothness & Differentiability Smoothness & Differentiability • Perturbations and Derivatives Derivatives • Continuous function • Differentiable • Differentiable • Layers • Input layers, output layers, I l l hidden layers

  11. Layer Structure Design Layer Structure Design • Design of hidden layer • Heuristic rules Heuristic rules • Number of hidden layers vs. computational resources computational resources • Feedforward network • No loops involved No loops involved

  12. Cost Function & Optimization Cost Function & Optimization • Learning with gradient descent • Cost function Cost function • Euclidean loss • Non negative smooth • Non ‐ negative, smooth, differentiable

  13. Cost Function & Optimization Cost Function & Optimization • Gradient Descent • Gradient vector

  14. Cost Function & Optimization Cost Function & Optimization • Extension to multiple dimensions • m variables m variables • Small change in variable • Small change in cost • Small change in cost

  15. Neural Nets for Neural Nets for Computer Vision Based on Tutorials at CVPR 2012 and 2014 by Marc’Aurelio Ranzato

  16. Building an Object Recognition System Building an Object Recognition System IDEA: Use data to optimize features for the given task given task

  17. Building an Object Recognition System Building an Object Recognition System What we want: Use parameterized function such that a) features are computed efficiently b) features can be trained efficiently b) features can be trained efficiently

  18. Building an Object Recognition System Building an Object Recognition System • Everything becomes adaptive • No distinction between feature extractor and classifier • Big non-linear system trained from raw pixels to labels

  19. Building an Object Recognition System Building an Object Recognition System Q Q: How can we build such a highly non-linear system? ? A: By combining simple building blocks we can make more and more complex systems

  20. Building a Complicated Function Building a Complicated Function • Function composition is p at the core of deep learning methods • Each “simple function” p will have parameters subject to training

  21. Implementing a Complicated Function Implementing a Complicated Function

  22. Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

  23. Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets Each black box can have trainable parameters Their Each black box can have trainable parameters. Their composition makes a highly non-linear system.

  24. Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets System produces hierarchy of features

  25. Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

  26. Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

  27. Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

  28. Key Ideas of Neural Nets Key Ideas of Neural Nets IDEA IDEA # 1 IDEA IDEA # 1 Learn features from data IDEA # IDEA # 2 Use differentiable functions that produce Use differentiable functions that produce features efficiently IDEA # 3 IDEA # E d End-to-end learning: d l i no distinction between feature extractor and classifier classifier IDEA # IDEA # 4 “Deep” architectures: cascade of simpler non linear modules cascade of simpler non-linear modules

  29. Key Questions Key Questions • What is the input output mapping? • What is the input-output mapping? • How are parameters trained? • How are parameters trained? • How computational expensive is it? • How computational expensive is it? • How well does it work? • How well does it work?

  30. Supervised Deep Learning Supervised Deep Learning Marc’Aurelio Ranzato

  31. Supervised Learning Supervised Learning {(x i y i ) i=1 {(x i , y i ), i 1... P } training set P } training set x i i-th input training example y i i-th target label P number of training examples • Goal: predict the target label of unseen inputs G l di h l b l f i

  32. Supervised Learning Examples

  33. Supervised Deep Learning

  34. Neural Networks Assumptions (for the next few slides): p ( ) • The input image is vectorized (disregard the spatial layout of pixels) • The target label is discrete (classification) g ( ) Question: what class of functions shall we consider to map the input into the output? to map the input into the output? Answer: composition of simpler functions. Follow-up questions: Why not a linear combination? Follow up questions: Why not a linear combination? What are the “simpler” functions? What is the interpretation? Answer: later... Answer: later...

  35. Neural Networks: example p x input x input h 1 1-st layer hidden units h 2 2-nd layer hidden units o output o output Example of a 2 hidden layer neural network (or 4 Example of a 2 hidden layer neural network (or 4 layer network, counting also input and output)

  36. Forward Propagation Forward Propagation Forward propagation is the process of Forward propagation is the process of computing the output of the network given its input input

  37. Forward Propagation W 1 1 st layer weight matrix or weights b 1 1 st layer biases b 1 layer biases The non-linearity u=max(0,v) is called ReLU ReLU in the DL literature. • Each output hidden unit takes as input all the units at the • previous layer: each such layer is called “fully previous layer: each such layer is called fully fully connected fully connected connected onnected”

  38. Rectified Linear Unit (ReLU) Rectified Linear Unit (ReLU) 38

  39. Forward Propagation W 2 2 nd layer weight matrix or weights y g g b 2 2 nd layer biases b

  40. Forward Propagation W 3 3 rd layer weight matrix or weights y g g b 3 3 rd layer biases b

  41. Alternative Graphical Representations Alternative Graphical Representations

  42. Interpretation • Question: Why can't the mapping between layers be linear? • Answer: Because composition of linear functions is a A B iti f li f ti i linear function. Neural network would reduce to (1 layer) logistic regression. • Question: What do ReLU layers accomplish? • Answer: Piece-wise linear tiling: mapping is locally linear.

  43. Interpretation • Question: Why do we need many layers? • Answer: When input has hierarchical structure, the use of a hierarchical architecture is potentially more efficient f hi hi l hit t i t ti ll ffi i t because intermediate computations can be re-used. DL architectures are efficient also because they use distributed representations which are shared across classes.

  44. Interpretation p 44

  45. Interpretation • Distributed Distributed representations • Feature sharing • Feature sharing • Compositionality 45

Recommend


More recommend