ece 5984 introduction to machine learning
play

ECE 5984: Introduction to Machine Learning Topics: Neural Networks - PowerPoint PPT Presentation

ECE 5984: Introduction to Machine Learning Topics: Neural Networks Backprop Readings: Murphy 16.5 Dhruv Batra Virginia Tech Administrativia HW3 Due: in 2 weeks You will implement primal & dual SVMs Kaggle


  1. ECE 5984: Introduction to Machine Learning Topics: – Neural Networks – Backprop Readings: Murphy 16.5 Dhruv Batra Virginia Tech

  2. Administrativia • HW3 – Due: in 2 weeks – You will implement primal & dual SVMs – Kaggle competition: Higgs Boson Signal vs Background classification – https://inclass.kaggle.com/c/2015-Spring-vt-ece-machine- learning-hw3 – https://www.kaggle.com/c/higgs-boson (C) Dhruv Batra 2

  3. Administrativia • Project Mid-Sem Spotlight Presentations – Friday: 5-7pm, 3-5pm Whittemore 654 – 5 slides (recommended) – 4 minute time (STRICT) + 1-2 min Q&A – Tell the class what you’re working on – Any results yet? – Problems faced? – Upload slides on Scholar (C) Dhruv Batra 3

  4. Recap of Last Time (C) Dhruv Batra 4

  5. Not linearly separable data • Some datasets are not linearly separable! – http://www.eee.metu.edu.tr/~alatan/Courses/Demo/ AppletSVM.html

  6. Addressing non-linearly separable data – Option 1, non-linear features • Choose non-linear features, e.g., – Typical linear features: w 0 + ∑ i w i x i – Example of non-linear features: • Degree 2 polynomials, w 0 + ∑ i w i x i + ∑ ij w ij x i x j • Classifier h w ( x ) still linear in parameters w – As easy to learn – Data is linearly separable in higher dimensional spaces – Express via kernels (C) Dhruv Batra Slide Credit: Carlos Guestrin 6

  7. Addressing non-linearly separable data – Option 2, non-linear classifier • Choose a classifier h w ( x ) that is non-linear in parameters w , e.g., – Decision trees, neural networks, … • More general than linear classifiers • But, can often be harder to learn (non-convex optimization required) • Often very useful (outperforms linear classifiers) • In a way, both ideas are related (C) Dhruv Batra Slide Credit: Carlos Guestrin 7

  8. Biological Neuron (C) Dhruv Batra 8

  9. Recall: The Neuron Metaphor • Neurons – accept information from multiple inputs, – transmit information to other neurons. • Multiply inputs by weights along edges • Apply some function to the set of inputs at each node Slide Credit: HKUST 9

  10. Types of Neurons 1 θ 1 θ 0 θ 2 f ( ~ x, ✓ ) 1 θ D θ 1 θ 0 Linear Neuron θ 2 f ( ~ x, ✓ ) 1 θ D θ 1 θ 0 Logistic Neuron θ 2 f ( ~ x, ✓ ) θ D Potentially more. Require a convex loss function for gradient descent training. Perceptron Slide Credit: HKUST 10

  11. Limitation • A single “neuron” is still a linear decision boundary • What to do? • Idea: Stack a bunch of them together! (C) Dhruv Batra 11

  12. Multilayer Networks • Cascade Neurons together • The output from one layer is the input to the next • Each Layer has its own sets of weights ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ f ( x, ~ ✓ 1 , 1 ✓ ) θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 12

  13. Universal Function Approximators • Theorem – 3-layer network with linear outputs can uniformly approximate any continuous function to arbitrary accuracy, given enough hidden units [Funahashi ’89] (C) Dhruv Batra 13

  14. Plan for Today • Neural Networks – Parameter learning – Backpropagation (C) Dhruv Batra 14

  15. Forward Propagation • On board (C) Dhruv Batra 15

  16. Feed-Forward Networks • Predictions are fed forward through the network to classify ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ ✓ 1 , 1 θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 16

  17. Feed-Forward Networks • Predictions are fed forward through the network to classify ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ ✓ 1 , 1 θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 17

  18. Feed-Forward Networks • Predictions are fed forward through the network to classify ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ ✓ 1 , 1 θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 18

  19. Feed-Forward Networks • Predictions are fed forward through the network to classify ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ ✓ 1 , 1 θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 19

  20. Feed-Forward Networks • Predictions are fed forward through the network to classify ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ ✓ 1 , 1 θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 20

  21. Feed-Forward Networks • Predictions are fed forward through the network to classify ~ ~ ✓ 1 , 0 ✓ 0 , 0 x 0 θ 2 , 0 ~ x 1 ✓ 0 , 1 θ 2 , 1 ~ ✓ 1 , 1 θ 2 , 2 x 2 ~ ~ ✓ 0 , 2 ✓ 1 , 2 x P Slide Credit: HKUST 21

  22. Gradient Computation • First let’s try: – Single Neuron for Linear Regression – Single Neuron for Logistic Regresion (C) Dhruv Batra 22

  23. Logistic regression • Learning rule – MLE: (C) Dhruv Batra Slide Credit: Carlos Guestrin 23

  24. Gradient Computation • First let’s try: – Single Neuron for Linear Regression – Single Neuron for Logistic Regresion • Now let’s try the general case • Backpropagation! – Really efficient (C) Dhruv Batra 24

  25. Neural Nets • Best performers on OCR – http://yann.lecun.com/exdb/lenet/index.html • NetTalk – Text to Speech system from 1987 – http://youtu.be/tXMaFhO6dIY?t=45m15s • Rick Rashid speaks Mandarin – http://youtu.be/Nu-nlQqFCKg?t=7m30s (C) Dhruv Batra 25

  26. Neural Networks • Demo – http://neuron.eng.wayne.edu/bpFunctionApprox/ bpFunctionApprox.html (C) Dhruv Batra 26

  27. Historical Perspective (C) Dhruv Batra 27

  28. Convergence of backprop • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer neural nets not convex – Gradient descent gets stuck in local minima – Hard to set learning rate – Selecting number of hidden units and layers = fuzzy process – NNs had fallen out of fashion in 90s, early 2000s – Back with a new name and significantly improved performance!!!! • Deep networks – Dropout and trained on much larger corpus (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

  29. Overfitting • Many many many parameters • Avoiding overfitting? – More training data – Regularization – Early stopping (C) Dhruv Batra 29

  30. A quick note (C) Dhruv Batra Image Credit: LeCun et al. ‘98 30

  31. Rectified Linear Units (ReLU) (C) Dhruv Batra 31

  32. Convolutional Nets • Basic Idea – On board – Assumptions: • Local Receptive Fields • Weight Sharing / Translational Invariance / Stationarity – Each layer is just a convolution! Sub-sampling Input image Convolutional layer layer (C) Dhruv Batra Image Credit: Chris Bishop 32

  33. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 33

  34. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 34

  35. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 35

  36. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 36

  37. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 37

  38. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 38

  39. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 39

  40. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 40

  41. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 41

  42. Convolutional Nets • Example: – http://yann.lecun.com/exdb/lenet/index.html C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Convolutions Convolutions Full connection (C) Dhruv Batra Image Credit: Yann LeCun, Kevin Murphy 42

  43. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 43

  44. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 44

  45. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 45

  46. Visualizing Learned Filters (C) Dhruv Batra Figure Credit: [Zeiler & Fergus ECCV14] 46

  47. Visualizing Learned Filters (C) Dhruv Batra Figure Credit: [Zeiler & Fergus ECCV14] 47

  48. Visualizing Learned Filters (C) Dhruv Batra Figure Credit: [Zeiler & Fergus ECCV14] 48

  49. Autoencoders • Goal – Compression: Output tries to predict input (C) Dhruv Batra Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders 49

  50. Autoencoders • Goal – Learns a low-dimensional “basis” for the data (C) Dhruv Batra Image Credit: Andrew Ng 50

  51. Stacked Autoencoders • How about we compress the low-dim features more? (C) Dhruv Batra Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders 51

  52. Sparse DBNs [Lee et al. ICML ‘09] Figure courtesy: Quoc Le (C) Dhruv Batra 52

  53. Stacked Autoencoders • Finally perform classification with these low-dim features. (C) Dhruv Batra Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders 53

  54. What you need to know about neural networks • Perceptron: – Representation – Derivation • Multilayer neural nets – Representation – Derivation of backprop – Learning rule – Expressive power

Recommend


More recommend