announcements
play

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of - PowerPoint PPT Presentation

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22 April tonight, Binary graded. For HW1, please get word count <100 167, 165,164 has done the homework. ( If you have not done it talk to me/TA! )


  1. Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22 April tonight, Binary graded. For HW1, please get word count <100 167, 165,164 has done the homework. ( If you have not done it talk to me/TA! ) Homework 3 (released ~tomorrow) due ~ 5 May Jupiter “GPU” home work released Wednesday. Due 10 May Projects: 27 Groups formed. Look at Piazza for help. Guidelines is on Piazza May 5 proposal due. TAs and Peter can approve. Today: • Stanford CNN 9, Kernel methods (Bishop 6), • Linear models for classification, Backpropagation Monday • Stanford CNN 10, Kernel methods (Bishop 6), SVM, • Play with Tensorflow playground before class http://playground.tensorflow.org

  2. Projects • 3-4 person groups preferred • Deliverables: Poster & Report & main code (plus proposal, midterm slide) • Topics your own or chose form suggested topics. Some physics inspired . • April 26 groups due to TA (if you don’t have a group, ask in piaza we can help). TAs will construct group after that. • May 5 proposal due. TAs and Peter can approve. • Proposal: One page: Title, A large paragraph, data, weblinks, references. • Something physical

  3. DataSet • 80 % preparation, 20 % ML • Kaggle: https://inclass.kaggle.com/datasets https://www.kaggle.com • UCI datasets: http://archive.ics.uci.edu/ml/index.php • Past projects… • Ocean acoustics data

  4. In 2017 Many choose the source localization • two CNN projects,

  5. 2018: Best reports 6,10,12 15; interesting 19, 47 poor 17; alone is hard 20.

  6. Bayes and Softmax (Bishop p. 198) • Bayes: Parametric Approach: Linear Classifier 3072x1 p ( x | y ) = p ( y | x ) p ( x ) p ( y | x ) p ( x ) f(x,W) = Wx + b 10x1 = Image p ( y ) P y ∈ Y p ( x, y ) 10x1 10x3072 10 numbers giving f( x , W ) class scores C Array of 32x32x3 numbers W (3072 numbers total) • Classification of N classes: parameters p ( x |C n ) p ( C n ) or weights p ( C n | x ) = Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 54 April 6, 2017 P N k =1 p ( x |C k ) p ( C k ) exp( a n ) = P N k =1 exp( a k ) with a n = ln ( p ( x |C n ) p ( C n ))

  7. Softmax to Logistic Regression (Bishop p. 198) p ( x |C 1 ) p ( C 1 ) p ( C 1 | x ) = P 2 k =1 p ( x |C k ) p ( C k ) exp( a 1 ) 1 = = P 2 1 + exp( − a ) k =1 exp( a k ) with a = ln p ( x |C 1 ) p ( C 1 ) p ( x |C 2 ) p ( C 2 ) s for binary classification we should use logis • 𝑏 # = ln 𝑞 𝒚 𝐷 # 𝑞 𝐷 # • 𝑏 = 𝑏 # − 𝑏 + # • 𝑞 𝐷 # 𝑦 = #-./0(2 3 42 5 )

  8. The Kullback-Leibler Divergence P true distribution, q is approximating distribution

  9. Cross entropy • KL divergence (p true q approximating) > 𝑞 = ln(𝑞 = ) - ∑ = > 𝑞 = ln(𝑟 = ) 𝐸 89 (𝑞||𝑟) = ∑ = = −𝐼 𝑞 + 𝐼(𝑞, 𝑟) • Cross entropy > 𝑞 = ln(𝑟 = ) 𝐼 𝑞, 𝑟 = 𝐼 𝑟 + 𝐸 89 (𝑞||𝑟) = - ∑ = • Implementations tf.keras.losses.CategoricalCrossentropy() tf.losses.sparse_softmax_cross_entropy torch.nn.CrossEntropyLoss()

  10. Cross-entropy or “ softmax ” function for multi-class classification z e i = y The output units use a non-local non-linearity: i å z j e j y y y output units 1 2 3 ¶ y i = - y ( 1 y ) i i ¶ z i z z z 2 3 1 target value å = - E t ln y The natural cost function is the negative log prob j j of the right answer j ¶ y ¶ ¶ E E å j = = - y t i i ¶ ¶ ¶ z y z i j i j

  11. Reminder: 1x1 convolutions 1x1 CONV 56 Reminder: 1x1 convolutions with 32 filters 56 preserves spatial dimensions, reduces depth! 56 Projects depth to lower 56 dimension (combination of 64 32 feature maps) 1x1 CONV Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 52 May 2, 2017 May 2, 2017 56 with 32 filters 56 (each filter has size 1x1x64, and performs a 64-dimensional dot product) 56 56 64 32 Summary: CNN Architectures Lecture 9 - Lecture 9 - May 2, 2017 May 2, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 51 Case Studies - AlexNet - VGG - GoogLeNet - ResNet Also.... - NiN (Network in Network) - DenseNet - Wide ResNet - FractalNet - ResNeXT - SqueezeNet - Stochastic Depth 10 Lecture 9 - May 2, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung 0

  12. Softmax Case Study: ResNet FC 1000 Pool 3x3 conv, 64 [He et al., 2015] 3x3 conv, 64 3x3 conv, 64 relu 3x3 conv, 64 Very deep networks using residual F(x) + x 3x3 conv, 64 connections 3x3 conv, 64 .. . conv - 152-layer model for ImageNet 3x3 conv, 128 X 3x3 conv, 128 F(x) relu - ILSVRC’15 classification winner identity 3x3 conv, 128 3x3 conv, 128 (3.57% top 5 error) conv 3x3 conv, 128 - Swept all classification and 3x3 conv, 128 / 2 3x3 conv, 64 detection competitions in 3x3 conv, 64 X ILSVRC’15 and COCO’15! 3x3 conv, 64 Residual block 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Pool 7x7 conv, 64 / 2 Input Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 65 May 2, 2017 May 2, 2017

  13. Case Study: ResNet [He et al., 2015] What happens when we continue stacking deeper layers on a “plain” convolutional neural network? 56-layer Training error 56-layer Test error 20-layer 20-layer Iterations Iterations Case Study: ResNet 56-layer model performs worse on both training and test error [He et al., 2015] -> The deeper model performs worse, but it’s not caused by overfitting! Hypothesis: the problem is an optimization problem, deeper models are harder to Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - May 2, 2017 May 2, 2017 68 optimize Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - May 2, 2017 May 2, 2017 69

  14. Case Study: ResNet [He et al., 2015] Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping H(x) = F(x) + x relu H(x) F(x) + x Use layers to fit residual conv conv F(x) = H(x) - x X F(x) relu relu identity instead of conv conv H(x) directly X X “Plain” layers Residual block Lecture 9 - Lecture 9 - May 2, 2017 May 2, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung 72 72

  15. Kernels • Kernel function k ( x , x ′ ) = φ ( x ) T φ ( x ′ ) . (6.1) see that the kernel is a symmetric function of its arguments • Kernel trick: substitute the inner product of freatures with

  16. Image by MIT OpenCourseWare. 4 5 |{z} |{z} |{z} |{z} Kernels We might want to consider something more complicated than a linear model: � � ⇥ x (1)2 , x (2)2 , x (1) x (2) ⇤ Example 1 : [ x (1) , x (2) ] → Φ [ x (1) , x (2) ] = Information unchanged, but now we have a linear classifier on the transformed points. With the kernel trick, we just need kernel Input Space Feature Space 𝑙 𝒃, 𝒄 = 𝜲(𝒃) F 𝜲(𝒄) Image by MIT OpenCourseWare. k ( x , x ′ ) = φ ( x ) T φ ( x ′ ) . (6.1) see that the kernel is a symmetric function of its arguments

  17. Basis expansion

  18. Gaussian Process (Bishop 6.4, Murphy15) t n = y n + ϵ n f ( x ) ∼ GP ( m ( x ) , κ ( x , x ′ ))

  19. Dual representation, Sec 6.2 Primal problem: min 𝐹(𝒙) 𝒙 > 𝒙 F 𝒚 = − 𝑢 = 2 + V 𝐹 = # V + 𝒙 2 = 𝒀𝒙 − 𝒖 + + + + ∑ = + 𝒙 2 Solution 𝒙 = 𝒀 - 𝒖 = (𝒀 F 𝒀 + 𝜇𝑱 𝑵 ) 4𝟐 𝒀 F 𝒖 = 𝒀 F (𝒀𝒀 𝑼 + 𝜇𝑱 𝑶 ) 4# 𝒖 = 𝒀 F (𝑳 + 𝜇𝑱 𝑶 ) 4# 𝒖 = 𝒀 F 𝒃 The kernel is 𝐋 = 𝒀𝒀 𝑼 Dual representation is : min 𝐹(𝒃) 𝒃 > 𝒙 F 𝒚 = − 𝑢 = 2 + V 𝐹 = # + + V + 𝒙 2 = 𝑳𝒃 − 𝒖 + + 𝒃 F 𝑳𝒃 + ∑ = a is found inverting NxN matrix w is found inverting MxM matrix Only kernels , no feature vectors

  20. Dual representation, Sec 6.2 Dual representation is : min 𝐹(𝒃) 𝒃 > 𝒙 F 𝒚 = − 𝑢 = 2 + V 𝐹 = # + 𝒙 2 = 𝑳𝒃 − 𝒖 + + + V + 𝒃 F 𝑳𝒃 + ∑ = Prediction > 𝑏 = 𝒚 = > 𝑏 = 𝑙(𝒚 = , 𝒚) 𝑧 = 𝒙 F 𝒚 = 𝒃 F 𝒀𝒚 = ∑ = F 𝒚 = ∑ = • Often a is sparse (… Support vector machines ) • We don’t need to know x or 𝝌 𝒚 . 𝑲𝒗𝒕𝒖 𝒖𝒊𝒇 𝑳𝒇𝒔𝒐𝒇𝒎 + + 𝜇 2 𝒃 F 𝑳𝒃 𝐹 𝒃 = 𝑳𝒃 − 𝒖 +

  21. Gaussian Kernels

  22. Gaussian Kernels

  23. Commonly used kernels p = + K ( x , y ) ( x . y 1 ) Polynomial: Parameters Gaussian 2 2 2 - - s that the user || x y || / = K ( x , y ) e radial basis must choose function = - d K ( x , y ) tanh ( k x.y ) Neural net: For the neural network kernel, there is one “hidden unit” per support vector, so the process of fitting the maximum margin hyperplane decides how many hidden units to use. Also, it may violate Mercer’s condition.

Recommend


More recommend