cse 152 computer vision
play

CSE 152: Computer Vision Hao Su Lecture 9: Convolutional Neural - PowerPoint PPT Presentation

CSE 152: Computer Vision Hao Su Lecture 9: Convolutional Neural Network and Learning Recap: Bias and Variance Bias error caused because the model lacks the ability to represent the (complex) concept Variance error caused because


  1. CSE 152: Computer Vision Hao Su Lecture 9: Convolutional Neural Network and Learning

  2. Recap: Bias and Variance • Bias – error caused because the model lacks the ability to represent the (complex) concept • Variance – error caused because the learning algorithm overreacts to small changes (noise) in the training data TotalLoss = Bias + Variance (+ noise)

  3. Credit: Elements of Statistical Learning, Second edition

  4. Recap: Universality Theorem Any continuous function f N → M f : R R Can be realized by a network with one hidden layer Reference for the reason: (given enough hidden http:// neurons) neuralnetworksanddeeplearn ing.com/chap4.html

  5. Recap: Universality is Not Enough • Neural network has very high capacity (millions of parameters) • By our basic knowledge of bias-variance tradeoff, so many parameters should imply very low bias, and very high variance. The test loss may not be small. • Many efforts of deep learning are about mitigating overfitting!

  6. Address Overfitting for NN • Use larger training data set • Design better network architecture

  7. Address Overfitting for NN • Use larger training data set • Design better network architecture

  8. Convolutional Neural Network

  9. Images as input to neural networks

  10. Images as input to neural networks

  11. Images as input to neural networks

  12. Convolutional Neural Networks • CNN = a multi-layer neural network with – Local connectivity: • Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions: • Learning shift-invariant filter kernels Image credit: A. Karpathy Jia-Bin Huang and Derek Hoiem, UIUC

  13. Perceptron: This is convolution!

  14. Recap: Image filtering 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 30 30 30 20 10 0 0 0 90 90 90 90 90 0 0 0 20 40 60 60 60 40 20 0 0 0 90 90 90 90 90 0 0 0 30 60 90 90 90 60 30 0 0 0 90 90 90 90 90 0 0 0 30 50 80 80 90 60 30 0 0 0 90 0 90 90 90 0 0 0 30 50 80 80 90 60 30 0 0 0 90 90 90 90 90 0 0 0 20 30 50 50 60 40 20 10 20 30 30 30 30 20 10 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 10 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Credit: S. Seitz

  15. Stride = 3

  16. Stride = 3

  17. Stride = 3

  18. Stride = 3

  19. Stride = 3

  20. 2D spatial filters

  21. k-D spatial filters

  22. Dimensions of convolution image Convolutional layer Slide: Lazebnik

  23. Dimensions of convolution feature map learned weights image Convolutional layer Slide: Lazebnik

  24. Dimensions of convolution feature map learned weights image Convolutional layer Slide: Lazebnik

  25. Dimensions of convolution next layer image Convolutional layer Slide: Lazebnik

  26. Dimensions of convolution Stride s

  27. Number of weights

  28. Number of weights

  29. Convolutional Neural Networks [Slides credit: Efstratios Gavves]

  30. Local connectivity

  31. Pooling operations • Aggregate multiple values into a single value

  32. Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features

  33. Yann LeCun’s MNIST CNN architecture

  34. AlexNet for ImageNet Layers - Kernel sizes - Strides - # channels - # kernels - Max pooling

  35. [Krizhevsky et al. 2012] AlexNet diagram (simplified) Input size 227 x 227 x 3 227 3x3 227 3x3 Stride 2 Stride 2 Conv 2 Conv 3 Conv 4 Conv 4 Conv 1 5 x 5 x 48 3 x 3 x 256 3 x 3 x 192 3 x 3 x 192 11 x 11 x 3 Stride 1 Stride 1 Stride 1 Stride 1 Stride 4 256 filters 384 filters 256 filters 384 filters 96 filters

  36. Learning Neural Networks

  37. Practice II: Setting Hyperparameters Lecture 2 - Fei-Fei Li & Justin Johnson & Serena April 5, 2018 Yeung

  38. Practice I: Setting Hyperparameters Idea #1 : Choose hyperparameters that work best on the data Your Dataset Lecture 2 - Fei-Fei Li & Justin Johnson & Serena April 5, 2018 56 Yeung

  39. Practice I: Setting Hyperparameters BAD : big network always works Idea #1 : Choose hyperparameters that work best on the data perfectly on training data Your Dataset Lecture 2 - Fei-Fei Li & Justin Johnson & Serena April 5, 2018 57 Yeung

  40. Practice I: Setting Hyperparameters BAD : big network always works Idea #1 : Choose hyperparameters that work best on the data perfectly on training data Your Dataset Idea #2 : Split data into train and test , choose hyperparameters that work best on test data test train 58

  41. Practice I: Setting Hyperparameters BAD : big network always works Idea #1 : Choose hyperparameters that work best on the data perfectly on training data Your Dataset Idea #2 : Split data into train and test , choose BAD : No idea how algorithm hyperparameters that work best on test data will perform on new data test train Lecture 2 - April 5, 2018 59

  42. Practice I: Setting Hyperparameters BAD : big network always works Idea #1 : Choose hyperparameters that work best on the data perfectly on training data Your Dataset Idea #2 : Split data into train and test , choose BAD : No idea how algorithm hyperparameters that work best on test data will perform on new data test train Idea #3 : Split data into train , val , and test ; choose Better! hyperparameters on val and evaluate on test Fei-Fei Li & Justin Johnson & Serena Yeung April 5, 2018 60 train validation test Lecture 2 -

  43. Practice II: Select Optimizer Lecture 2 - Fei-Fei Li & Justin Johnson & Serena April 5, 2018 Yeung

  44. Stochastic gradient descent Gradient from entire training set: • For large training data, gradient computation takes a long time • Leads to “slow learning” • Instead, consider a mini-batch with m samples • If sample size is large enough, properties approximate the dataset

  45. Stochastic gradient descent

  46. Stochastic gradient descent

  47. Stochastic gradient descent

  48. Stochastic gradient descent Build up velocity as a running mean of gradients.

  49. Many variations of using momentum • In PyTorch, you can manually specify the momentum of SGD • Or, you can use other optimization algorithms with “adaptive” momentum, e.g., ADAM • ADAM: Adaptive Moment Estimation • Empirically, ADAM usually converges faster, but SGD gives local minima with better generalizability

  50. Practice III: Data Augmentation Lecture 2 - Fei-Fei Li & Justin Johnson & Serena April 5, 2018 Yeung

  51. Horizontal flips

  52. Random crops and scales

  53. Color jitter

  54. Color jitter Can do a lot more: rotation, shear, non-rigid, motion blur, lens distortions, ….

  55. Exam • Linear algebra, such as • rank, null space, range, invertible, eigen decomposition, SVD, pseudo inverse, basic matrix calculus • Optimization: • Least square, low-rank approximation, statistical interpretation of PCA • Image formation • diffuse/specular reflection, Lambertian lighting equation • Filtering • Linear filter, filter vs convolution, properties of filters, filterbank, usage of filters, median filter • Statistics: • Bias, variance, bias-variance tradeoff, overfitting, underfitting • Neural network • Linear classifier, softmax, why linear classifier is insufficient, activation function, feed-forward pass, universality theorem, what does back- propagation do, stochastic gradient descent, concepts in neural networks, why CNN, concepts in CNN, how to set hyperparameter, moment in SGD, data augmentation

Recommend


More recommend