neural networks
play

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn - PowerPoint PPT Presentation

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020 Supervised Learning 1 Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where


  1. Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  2. Supervised Learning 1 ● Examples described by attribute values (Boolean, discrete, continuous, etc.) ● E.g., situations where I will/won’t wait for a table: Attributes Target Example Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait X 1 T F F T Some $$$ F T French 0–10 T X 2 T F F T Full $ F F Thai 30–60 F X 3 F T F F Some $ F F Burger 0–10 T X 4 T F T T Full $ F F Thai 10–30 T > 60 X 5 T F T F Full $$$ F T French F X 6 F T F T Some $$ T T Italian 0–10 T X 7 F T F F None $ T F Burger 0–10 F X 8 F F F T Some $$ T T Thai 0–10 T > 60 X 9 F T T F Full $ T F Burger F X 10 T T T T Full $$$ F T Italian 10–30 F X 11 F F F F None $ F F Thai 0–10 F X 12 T T T T Full $ F F Burger 30–60 T ● Classification of examples is positive (T) or negative (F) Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  3. Naive Bayes Models 2 ● Bayes rule p ( C ∣ A ) = 1 Z p ( A ∣ C ) p ( C ) ● Independence assumption p ( A ∣ C ) = p ( a 1 ,a 2 ,a 3 ,...,a n ∣ C ) ≃ p ( a i ∣ C ) ∏ i ● Weights p ( A ∣ C ) = ∏ p ( a i ∣ C ) λ i i Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  4. Naive Bayes Models 3 ● Linear model p ( A ∣ C ) = p ( a i ∣ C ) λ i ∏ i = λ i log p ( a i ∣ C ) exp ∑ i ● Probability distribution as features h i ( A ,C ) = log p ( a i ∣ C ) h 0 ( A ,C ) = log p ( C ) ● Linear model with features p ( C ∣ A ) ∝ ∑ λ i h i ( A ,C ) i Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  5. Linear Model 4 ● Weighted linear combination of feature values h j and weights λ j for example d i score ( λ, d i ) = ∑ λ j h j ( d i ) j ● Such models can be illustrated as a ”network” Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  6. Limits of Linearity 5 ● We can give each feature a weight ● But not more complex value relationships, e.g, – any value in the range [0;5] is equally good – values over 8 are bad – higher than 10 is not worse Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  7. XOR 6 ● Linear models cannot model XOR good bad bad good Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  8. Multiple Layers 7 ● Add an intermediate (”hidden”) layer of processing (each arrow is a weight) ● Have we gained anything so far? Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  9. Non-Linearity 8 ● Instead of computing a linear combination score ( λ, d i ) = ∑ λ j h j ( d i ) j ● Add a non-linear function score ( λ, d i ) = f (∑ λ j h j ( d i )) j ● Popular choices 1 tanh(x) sigmoid(x) = 1 + e − x ✻ ✻ ✲ ✲ (sigmoid is also called the ”logistic function”) Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  10. Deep Learning 9 ● More layers = deep learning Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  11. 10 example Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  12. Simple Neural Network 11 3.7 2.9 4.5 3.7 -5.2 2.9 5 . 1 - -2.0 -4.6 1 1 ● One innovation: bias units (no inputs, always value 1) Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  13. Sample Input 12 3.7 1.0 2.9 4 . 5 3.7 -5.2 0.0 2.9 5 . 1 - -2.0 -4.6 1 1 ● Try out two input values ● Hidden unit computation 1 sigmoid ( 1.0 × 3 . 7 + 0.0 × 3 . 7 + 1 × − 1 . 5 ) = sigmoid ( 2 . 2 ) = 1 + e − 2 . 2 = 0 . 90 1 sigmoid ( 1.0 × 2 . 9 + 0.0 × 2 . 9 + 1 × − 4 . 5 ) = sigmoid (− 1 . 6 ) = 1 + e 1 . 6 = 0 . 17 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  14. Computed Hidden 13 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 2.9 5 . 1 - -2.0 -4.6 1 1 ● Try out two input values ● Hidden unit computation 1 sigmoid ( 1.0 × 3 . 7 + 0.0 × 3 . 7 + 1 × − 1 . 5 ) = sigmoid ( 2 . 2 ) = 1 + e − 2 . 2 = 0 . 90 1 sigmoid ( 1.0 × 2 . 9 + 0.0 × 2 . 9 + 1 × − 4 . 5 ) = sigmoid (− 1 . 6 ) = 1 + e 1 . 6 = 0 . 17 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  15. Compute Output 14 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 2.9 5 . 1 - -2.0 -4.6 1 1 ● Output unit computation 1 sigmoid ( .90 × 4 . 5 + .17 × − 5 . 2 + 1 × − 2 . 0 ) = sigmoid ( 1 . 17 ) = 1 + e − 1 . 17 = 0 . 76 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  16. Computed Output 15 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 .76 2.9 5 . 1 - -2.0 -4.6 1 1 ● Output unit computation 1 sigmoid ( .90 × 4 . 5 + .17 × − 5 . 2 + 1 × − 2 . 0 ) = sigmoid ( 1 . 17 ) = 1 + e − 1 . 17 = 0 . 76 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  17. 16 why ”neural” networks? Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  18. Neuron in the Brain 17 ● The human brain is made up of about 100 billion neurons Dendrite Axon terminal Soma Axon Nucleus ● Neurons receive electric signals at the dendrites and send them to the axon Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  19. The Brain vs. Artificial Neural Networks 18 ● Similarities – Neurons, connections between neurons – Learning = change of connections, not change of neurons – Massive parallel processing ● But artificial neural networks are much simpler – computation within neuron vastly simplified – discrete time steps – typically some form of supervised learning with massive number of stimuli Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  20. 19 back-propagation training Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  21. Error 20 3.7 1.0 .90 2.9 4.5 3.7 -5.2 0.0 .17 .76 2.9 5 . 1 - -2.0 -4.6 1 1 ● Computed output: y = .76 ● Correct output: t = 1.0 ⇒ How do we adjust the weights? Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  22. Key Concepts 21 ● Gradient descent – error is a function of the weights – we want to reduce the error – gradient descent: move towards the error minimum – compute gradient → get direction to the error minimum – adjust weights towards direction of lower error ● Back-propagation – first adjust last set of weights – propagate error back to each previous layer – adjust their weights Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  23. Gradient Descent 22 error( λ ) gradient = 1 λ optimal λ current λ Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  24. Gradient Descent 23 Current Point Gradient for w 1 Optimum Combined Gradient Gradient for w 2 Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  25. Derivative of Sigmoid 24 1 ● Sigmoid sigmoid ( x ) = 1 + e − x ● Reminder: quotient rule ( f ( x ) = g ( x ) f ′ ( x ) − f ( x ) g ′ ( x ) ′ g ( x )) g ( x ) 2 d sigmoid ( x ) = d 1 ● Derivative 1 + e − x dx dx = 0 × ( 1 − e − x ) − ( − e − x ) ( 1 + e − x ) 2 e − x 1 1 + e − x ( 1 + e − x ) = 1 1 1 + e − x ( 1 − 1 + e − x ) = = sigmoid ( x )( 1 − sigmoid ( x )) Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  26. Final Layer Update 25 ● Linear combination of weights s = ∑ k w k h k ● Activation function y = sigmoid ( s ) 2 ( t − y ) 2 ● Error (L2 norm) E = 1 ● Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  27. Final Layer Update (1) 26 ● Linear combination of weights s = ∑ k w k h k ● Activation function y = sigmoid ( s ) 2 ( t − y ) 2 ● Error (L2 norm) E = 1 ● Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k ● Error E is defined with respect to y dE dy = d 1 2 ( t − y ) 2 = − ( t − y ) dy Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  28. Final Layer Update (2) 27 ● Linear combination of weights s = ∑ k w k h k ● Activation function y = sigmoid ( s ) 2 ( t − y ) 2 ● Error (L2 norm) E = 1 ● Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k ● y with respect to x is sigmoid ( s ) ds = d sigmoid ( s ) dy = sigmoid ( s )( 1 − sigmoid ( s )) = y ( 1 − y ) ds Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  29. Final Layer Update (3) 28 ● Linear combination of weights s = ∑ k w k h k ● Activation function y = sigmoid ( s ) 2 ( t − y ) 2 ● Error (L2 norm) E = 1 ● Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k ● x is weighted linear combination of hidden node values h k ds d dw k ∑ = w k h k = h k dw k k Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

  30. Putting it All Together 29 ● Derivative of error with regard to one weight w k dE = dE dy ds dw k dy ds dw k = − ( t − y ) y ( 1 − y ) h k – error – derivative of sigmoid: y ′ ● Weight adjustment will be scaled by a fixed learning rate µ ∆ w k = µ ( t − y ) y ′ h k Philipp Koehn Artificial Intelligence: Neural Networks 14 April 2020

Recommend


More recommend