neural networks
play

Neural Networks CE417: Introduction to Artificial Intelligence - PowerPoint PPT Presentation

Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragans slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019. 2


  1. Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragan’s slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.

  2. 2

  3. McCulloch-Pitts neuron: binary threshold 𝑧 𝑦 " π‘₯ " π‘₯ # 𝑦 # 𝑧 πœ„ : activation threshold … π‘₯ $ 𝑦 $ 𝑧 = '1, 𝑨 β‰₯ πœ„ Equivale 0, 𝑨 < πœ„ nt to 1 𝑐 𝑦 " π‘₯ " 𝑧 π‘₯ # 𝑦 # 𝑧 … π‘₯ $ bias: 𝑐 = βˆ’πœ„ 𝑦 $ 3

  4. Neural nets and the brain 𝑦 " π‘₯ " 𝑦 # π‘₯ # 𝑦 3 π‘₯ 3 + . .... 𝑦 $ π‘₯ $ 𝑐 β€’ Neural nets are composed of networks of computational models of neurons called perceptrons 4

  5. The perceptron 𝑦 " π‘₯ " 𝑦 # π‘₯ # 𝑦 3 π‘₯ 3 + . .... 1 if 7 π‘₯ 8 𝑦 8 β‰₯ πœ„ y = 8 0 else 𝑐 = βˆ’πœ„ 𝑦 $ π‘₯ $ β€’ A threshold unit – β€œFires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate β€’ A basic unit of Boolean circuits 5

  6. οΏ½ Perceptron 𝑦 " π‘₯ " z = 7 w > x > + b 𝑦 # π‘₯ # > 𝑦 3 π‘₯ 3 + .... x 2 𝑦 # 1 𝑐 𝑦 $ π‘₯ $ x 1 𝑦 " 0 } Lean this function } A step function across a hyperplane 6

  7. οΏ½ Learning the perceptron 𝑦 " π‘₯ " 𝑧 = H1 𝑗𝑔 7 w > x > + b β‰₯ 0 𝑦 # π‘₯ # > 0 π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓 𝑦 3 π‘₯ 3 𝑦 # + .... 𝑦 " 𝑦 $ 𝑐 π‘₯ $ β€’ Given a number of input output pairs, learn the weights and bias – Learn 𝑋 = [π‘₯ " , … , π‘₯ $ ] and b, given several π‘Œ, 𝑧 pairs 7

  8. Restating the perceptron x 1 x 2 x 3 𝑋 $ x d W d+1 x d+1 =1 } Restating the perceptron equation by adding another dimension to π‘Œ $R" 𝑧 = '1 𝑗𝑔 βˆ‘ π‘₯ 8 𝑦 8 β‰₯ 0 8S" Where 𝑦 $R" = 1 0 π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓 $R" } Note that the boundary βˆ‘ π‘₯ 8 𝑦 8 β‰₯ 0 is now a hyperplane 8S" through origin 8

  9. The Perceptron Problem $R" 7 π‘₯ 8 𝑦 8 = 0 β€’ Find the hyperplane that perfectly separates the two groups of points 8S" 34 9

  10. Perceptron Algorithm: Summary β€’ Cycle through the training instances β€’ Only update 𝒙 on misclassified instances β€’ If instance misclassified: – If instance is positive class 𝒙 = 𝒙 + π’š – If instance is negative class 𝒙 = 𝒙 βˆ’ π’š 10

  11. Training of Single Layer 𝒙 VR" = 𝒙 V βˆ’ πœƒπ›ΌπΉ Z 𝒙 V } Weight update for a training pair (π’š Z , 𝑧 (Z) ) : } Perceptron: If sign(𝒙 _ π’š (Z) ) β‰  𝑧 (Z) then if misclassified 𝛼𝐹 Z 𝒙 V = βˆ’πœƒπ’š (Z) 𝑧 (Z) 𝐹 Z 𝒙 = βˆ’π’™ _ π’š (Z) 𝑧 (Z) } } 𝛼𝐹 Z 𝒙 V = βˆ’πœƒ(𝑧 (Z) βˆ’ 𝒙 _ π’š (Z) )π’š (Z) } ADALINE: } Widrow-Hoff, LMS, or delta rule 𝐹 Z 𝒙 = 𝑧 (Z) βˆ’ 𝒙 _ π’š (Z) # 11

  12. Perceptron vs. Delta Rule } Perceptron learning rule: } guaranteed to succeed if training examples are linearly separable } Delta rule: } guaranteed to converge to the hypothesis with the minimum squared error } can also be used for regression problems 12

  13. Reminder: Linear Classifiers Inputs are feature values Β§ Each feature has a weight Β§ Sum is the activation Β§ w 1 If the activation is: Β§ f 1 w 2 S Β§ Positive, output +1 f 2 >0? w 3 Β§ Negative, output -1 f 3 13

  14. οΏ½ The β€œsoft” perceptron (logistic) π’š 𝟐 𝒙 𝟐 π’š πŸ‘ 𝒙 πŸ‘ z = 7 w > x > βˆ’ ΞΈ π’š πŸ’ 𝒙 πŸ’ > + ..... 1 y = 1 + exp(βˆ’z) π’š 𝑢 𝒙 𝑢 𝑐 = βˆ’πœ„ β€’ A β€œsquashing” function instead of a threshold at the output – The sigmoid β€œactivation” replaces the threshold β€’ Activation: The function that acts on the weighted combination of inputs (and threshold) 14

  15. Sigmoid neurons } These give a real-valued output that is a smooth and bounded function of their total input. } Typically they use the logistic function } They have nice derivatives. 15

  16. Other β€œactivations” π’š 𝟐 𝒙 𝟐 sigmoid π’š πŸ‘ 𝒙 πŸ‘ 1 1 π’š πŸ’ 𝒙 πŸ’ 1 + exp (βˆ’π‘¨) + .... tanh 𝒄 π’š 𝑢 𝒙 𝑢 tanh 𝑨 (1 +𝑓 l ) log β€’ Does not always have to be a squashing function – We will hear more about activations later β€’ We will continue to assume a β€œthreshold” activation in this lecture 16

  17. How to get probabilistic decisions? } Activation: } If 𝑨 = 𝒙 _ π’š very positive Γ  want probability going to 1 } If 𝑨 = 𝒙 _ π’š very negative Γ  want probability going to 0 } Sigmoid function 1 Ο† ( z ) = 1 + e βˆ’ z 17

  18. Best w? Maximum likelihood estimation: } X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w with: i 1 𝑄 𝑧 8 = +1 𝑦 8 ; π‘₯ = 1 + 𝑓 o𝒙 p π’š 1 𝑄 𝑧 8 = βˆ’1 𝑦 8 ; π‘₯ = 1 βˆ’ 1 + 𝑓 o𝒙 p π’š = Logistic Regression 18

  19. Multiclass Logistic Regression _ π’š } Multi-class linear classification 𝒙 " A weight vector for each class: } 𝒙 q Score (activation) of a class y: _ π’š } 𝒙 q _ π’š 𝒙 3 _ π’š _ π’š 𝒙 q 𝒙 # Prediction w/highest score wins: } } How to make the scores into probabilities? e z 1 e z 2 e z 3 z 1 , z 2 , z 3 β†’ e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 original activations softmax activations 19

  20. Best w? } Maximum likelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i with: p π’š (s) 𝒙 r s e w y ( i ) Β· f ( x ( i ) ) 𝑓 𝑄 𝑧 8 𝑦 8 ; π‘₯ = P ( y ( i ) | x ( i ) ; w ) = y e w y Β· f ( x ( i ) ) p π’š (s) P u 𝑓 𝒙 t βˆ‘ vS" = Multi-Class Logistic Regression 20

  21. Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i g ( w ) } init w } for iter = 1, 2, … X r log P ( y ( i ) | x ( i ) ; w ) w w + Ξ± ⇀ i 21

  22. Logistic regression: multi-class VR" = 𝒙 w V βˆ’ πœƒπ›Ό 𝑿 𝐾(𝑿 V ) 𝒙 w Z 𝑄 𝑧 w |π’š 8 ; 𝑿 βˆ’ 𝑧 w 8 π’š 8 𝛼 𝒙 z 𝐾 𝑿 = 7 8S" 22

  23. Stochastic Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: once gradient on one training example has been computed, might as well incorporate before computing next one } init w } for iter = 1, 2, … } pick random j w w + Ξ± ⇀ r log P ( y ( j ) | x ( j ) ; w ) 23

  24. Mini-Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: gradient over small set of training examples (=mini-batch) can be computed, might as well do that instead of a single one } init w } for iter = 1, 2, … } pick random subset of training examples J X r log P ( y ( j ) | x ( j ) ; w ) w w + Ξ± ⇀ j ∈ J 24

  25. Networks with hidden units } Networks without hidden units are very limited in the input-output mappings they can learn to model. } A simple function such as XOR cannot be modeled with single layer network } More layers of linear units do not help. Its still linear. } Fixed output non-linearities are not enough. } We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? 25

  26. Neural Networks 26

  27. The multi-layer perceptron 27

  28. Neural Networks Properties } Theorem (Universal Function Approximators). A two- layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy. } Practical considerations } Can be seen as learning the features } Large number of neurons } Danger for overfitting } (hence early stopping!) 28

  29. Universal Function Approximation Theorem* } In words: Given any continuous function f(x), if a 2-layer neural network has enough hidden units, then there is a choice of weights that allow it to closely approximate f(x). Cybenko (1989) β€œApproximations by superpositions of sigmoidal functions” Hornik (1991) β€œApproximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 29

  30. Universal Function Approximation Theorem* Cybenko (1989) β€œApproximations by superpositions of sigmoidal functions” Hornik (1991) β€œApproximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 30

  31. Expressiveness of neural networks } All Boolean functions can be represented by a network with a single hidden layer } But it might require exponential (in number of inputs) hidden units } Continuous functions: } Any continuous function on a compact domain can be approximated to an arbitrary accuracy, by network with one hidden layer [Cybenko 1989] } Any function can be approximated to an arbitrarily accuracy by a network with two hidden layers [Cybenko 1988] 31

Recommend


More recommend