Neural Networks CE417: Introduction to Artificial Intelligence - PowerPoint PPT Presentation

Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragan’s slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.

McCulloch-Pitts neuron: binary threshold 𝑧 𝑦 " 𝑥 " 𝑥 # 𝑦 # 𝑧 𝜄 : activation threshold … 𝑥 $ 𝑦 $ 𝑧 = '1, 𝑨 ≥ 𝜄 Equivale 0, 𝑨 < 𝜄 nt to 1 𝑐 𝑦 " 𝑥 " 𝑧 𝑥 # 𝑦 # 𝑧 … 𝑥 $ bias: 𝑐 = −𝜄 𝑦 $ 3

Neural nets and the brain 𝑦 " 𝑥 " 𝑦 # 𝑥 # 𝑦 3 𝑥 3 + . .... 𝑦 $ 𝑥 $ 𝑐 • Neural nets are composed of networks of computational models of neurons called perceptrons 4

The perceptron 𝑦 " 𝑥 " 𝑦 # 𝑥 # 𝑦 3 𝑥 3 + . .... 1 if 7 𝑥 8 𝑦 8 ≥ 𝜄 y = 8 0 else 𝑐 = −𝜄 𝑦 $ 𝑥 $ • A threshold unit – “Fires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate • A basic unit of Boolean circuits 5

� Perceptron 𝑦 " 𝑥 " z = 7 w > x > + b 𝑦 # 𝑥 # > 𝑦 3 𝑥 3 + .... x 2 𝑦 # 1 𝑐 𝑦 $ 𝑥 $ x 1 𝑦 " 0 } Lean this function } A step function across a hyperplane 6

� Learning the perceptron 𝑦 " 𝑥 " 𝑧 = H1 𝑗𝑔 7 w > x > + b ≥ 0 𝑦 # 𝑥 # > 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 𝑦 3 𝑥 3 𝑦 # + .... 𝑦 " 𝑦 $ 𝑐 𝑥 $ • Given a number of input output pairs, learn the weights and bias – Learn 𝑋 = [𝑥 " , … , 𝑥 $ ] and b, given several 𝑌, 𝑧 pairs 7

Restating the perceptron x 1 x 2 x 3 𝑋 $ x d W d+1 x d+1 =1 } Restating the perceptron equation by adding another dimension to 𝑌 $R" 𝑧 = '1 𝑗𝑔 ∑ 𝑥 8 𝑦 8 ≥ 0 8S" Where 𝑦 $R" = 1 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 $R" } Note that the boundary ∑ 𝑥 8 𝑦 8 ≥ 0 is now a hyperplane 8S" through origin 8

The Perceptron Problem $R" 7 𝑥 8 𝑦 8 = 0 • Find the hyperplane that perfectly separates the two groups of points 8S" 34 9

Perceptron Algorithm: Summary • Cycle through the training instances • Only update 𝒙 on misclassified instances • If instance misclassified: – If instance is positive class 𝒙 = 𝒙 + 𝒚 – If instance is negative class 𝒙 = 𝒙 − 𝒚 10

Training of Single Layer 𝒙 VR" = 𝒙 V − 𝜃𝛼𝐹 Z 𝒙 V } Weight update for a training pair (𝒚 Z , 𝑧 (Z) ) : } Perceptron: If sign(𝒙 _ 𝒚 (Z) ) ≠ 𝑧 (Z) then if misclassified 𝛼𝐹 Z 𝒙 V = −𝜃𝒚 (Z) 𝑧 (Z) 𝐹 Z 𝒙 = −𝒙 _ 𝒚 (Z) 𝑧 (Z) } } 𝛼𝐹 Z 𝒙 V = −𝜃(𝑧 (Z) − 𝒙 _ 𝒚 (Z) )𝒚 (Z) } ADALINE: } Widrow-Hoff, LMS, or delta rule 𝐹 Z 𝒙 = 𝑧 (Z) − 𝒙 _ 𝒚 (Z) # 11

Perceptron vs. Delta Rule } Perceptron learning rule: } guaranteed to succeed if training examples are linearly separable } Delta rule: } guaranteed to converge to the hypothesis with the minimum squared error } can also be used for regression problems 12

Reminder: Linear Classifiers Inputs are feature values § Each feature has a weight § Sum is the activation § w 1 If the activation is: § f 1 w 2 S § Positive, output +1 f 2 >0? w 3 § Negative, output -1 f 3 13

� The “soft” perceptron (logistic) 𝒚 𝟐 𝒙 𝟐 𝒚 𝟑 𝒙 𝟑 z = 7 w > x > − θ 𝒚 𝟒 𝒙 𝟒 > + ..... 1 y = 1 + exp(−z) 𝒚 𝑶 𝒙 𝑶 𝑐 = −𝜄 • A “squashing” function instead of a threshold at the output – The sigmoid “activation” replaces the threshold • Activation: The function that acts on the weighted combination of inputs (and threshold) 14

Sigmoid neurons } These give a real-valued output that is a smooth and bounded function of their total input. } Typically they use the logistic function } They have nice derivatives. 15

Other “activations” 𝒚 𝟐 𝒙 𝟐 sigmoid 𝒚 𝟑 𝒙 𝟑 1 1 𝒚 𝟒 𝒙 𝟒 1 + exp (−𝑨) + .... tanh 𝒄 𝒚 𝑶 𝒙 𝑶 tanh 𝑨 (1 +𝑓 l ) log • Does not always have to be a squashing function – We will hear more about activations later • We will continue to assume a “threshold” activation in this lecture 16

How to get probabilistic decisions? } Activation: } If 𝑨 = 𝒙 _ 𝒚 very positive à want probability going to 1 } If 𝑨 = 𝒙 _ 𝒚 very negative à want probability going to 0 } Sigmoid function 1 φ ( z ) = 1 + e − z 17

Best w? Maximum likelihood estimation: } X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w with: i 1 𝑄 𝑧 8 = +1 𝑦 8 ; 𝑥 = 1 + 𝑓 o𝒙 p 𝒚 1 𝑄 𝑧 8 = −1 𝑦 8 ; 𝑥 = 1 − 1 + 𝑓 o𝒙 p 𝒚 = Logistic Regression 18

Multiclass Logistic Regression _ 𝒚 } Multi-class linear classification 𝒙 " A weight vector for each class: } 𝒙 q Score (activation) of a class y: _ 𝒚 } 𝒙 q _ 𝒚 𝒙 3 _ 𝒚 _ 𝒚 𝒙 q 𝒙 # Prediction w/highest score wins: } } How to make the scores into probabilities? e z 1 e z 2 e z 3 z 1 , z 2 , z 3 → e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 original activations softmax activations 19

Best w? } Maximum likelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i with: p 𝒚 (s) 𝒙 r s e w y ( i ) · f ( x ( i ) ) 𝑓 𝑄 𝑧 8 𝑦 8 ; 𝑥 = P ( y ( i ) | x ( i ) ; w ) = y e w y · f ( x ( i ) ) p 𝒚 (s) P u 𝑓 𝒙 t ∑ vS" = Multi-Class Logistic Regression 20

Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i g ( w ) } init w } for iter = 1, 2, … X r log P ( y ( i ) | x ( i ) ; w ) w w + α ⇤ i 21

Logistic regression: multi-class VR" = 𝒙 w V − 𝜃𝛼 𝑿 𝐾(𝑿 V ) 𝒙 w Z 𝑄 𝑧 w |𝒚 8 ; 𝑿 − 𝑧 w 8 𝒚 8 𝛼 𝒙 z 𝐾 𝑿 = 7 8S" 22

Stochastic Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: once gradient on one training example has been computed, might as well incorporate before computing next one } init w } for iter = 1, 2, … } pick random j w w + α ⇤ r log P ( y ( j ) | x ( j ) ; w ) 23

Mini-Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: gradient over small set of training examples (=mini-batch) can be computed, might as well do that instead of a single one } init w } for iter = 1, 2, … } pick random subset of training examples J X r log P ( y ( j ) | x ( j ) ; w ) w w + α ⇤ j ∈ J 24

Networks with hidden units } Networks without hidden units are very limited in the input-output mappings they can learn to model. } A simple function such as XOR cannot be modeled with single layer network } More layers of linear units do not help. Its still linear. } Fixed output non-linearities are not enough. } We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? 25

Neural Networks 26

The multi-layer perceptron 27

Neural Networks Properties } Theorem (Universal Function Approximators). A two- layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy. } Practical considerations } Can be seen as learning the features } Large number of neurons } Danger for overfitting } (hence early stopping!) 28

Universal Function Approximation Theorem* } In words: Given any continuous function f(x), if a 2-layer neural network has enough hidden units, then there is a choice of weights that allow it to closely approximate f(x). Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 29

Universal Function Approximation Theorem* Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 30

Expressiveness of neural networks } All Boolean functions can be represented by a network with a single hidden layer } But it might require exponential (in number of inputs) hidden units } Continuous functions: } Any continuous function on a compact domain can be approximated to an arbitrary accuracy, by network with one hidden layer [Cybenko 1989] } Any function can be approximated to an arbitrarily accuracy by a network with two hidden layers [Cybenko 1988] 31

Neural Networks CE417: Introduction to Artificial Intelligence - PowerPoint PPT Presentation

Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragans slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019. 2

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Model Counting Aditya A. Shrotri Dept. of Computer Science, Rice University Research Overview:

Informatics 1 Computation and Logic Boolean Algebra CNF DNF Michael Fourman 1 To determine

51. Graph the solution set: x 7 > 3x 5 or 3x 5 x + 11

Outline 1. Motivation 2. alldifferent Motivation 3. nvalue alldifferent nvalue 4. global

Logica (I&E) najaar 2018 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Introduction to Logic in Computer Science: Autumn 2006 Ulle Endriss Institute for Logic,

Mayank Kejriwal Information Sciences Institute/USC kejriwal@isi.edu

there is a string u on the query tape, and the next state M assumes is the yes state if u T

Sambuz

Useful Links

Newsletter

Mail Us

Neural Networks CE417: Introduction to Artificial Intelligence - PowerPoint PPT Presentation

Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragans slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019. 2

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Model Counting Aditya A. Shrotri Dept. of Computer Science, Rice University Research Overview:

Informatics 1 Computation and Logic Boolean Algebra CNF DNF Michael Fourman 1 To determine

51. Graph the solution set: x 7 &gt; 3x 5 or 3x 5 x + 11

Outline 1. Motivation 2. alldifferent Motivation 3. nvalue alldifferent nvalue 4. global

Logica (I&amp;E) najaar 2018 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Introduction to Logic in Computer Science: Autumn 2006 Ulle Endriss Institute for Logic,

Mayank Kejriwal Information Sciences Institute/USC kejriwal@isi.edu

there is a string u on the query tape, and the next state M assumes is the yes state if u T

Sambuz

Useful Links

Newsletter

Mail Us

51. Graph the solution set: x 7 > 3x 5 or 3x 5 x + 11

Logica (I&E) najaar 2018 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet