Learning Neural Networks Learning Neural Networks Neural Networks - PowerPoint PPT Presentation

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural Networks can represent complex decision boundaries decision boundaries – Variable size. Any boolean function can be Variable size. Any boolean function can be – represented. Hidden units can be interpreted as new represented. Hidden units can be interpreted as new features features – Deterministic Deterministic – – Continuous Parameters Continuous Parameters – Learning Algorithms for neural networks Learning Algorithms for neural networks – Local Search. The same algorithm as for sigmoid Local Search. The same algorithm as for sigmoid – threshold units threshold units – Eager Eager – – Batch or Online Batch or Online –

Neural Network Hypothesis Space Neural Network Hypothesis Space ŷ W 9 a 6 a 7 a 8 W 6 W 7 W 8 x 1 x 2 x 3 x 4 Each unit a 6 Each unit a 6 , a , a 7 7 , a , a 8 8 , and , and ŷ ŷ computes a sigmoid function of its inputs: computes a sigmoid function of its inputs: σ (W σ (W σ (W σ (W = σ = σ = σ = σ a 6 (W 6 · X) a X) a 7 (W 7 · X) a X) a 8 (W 8 · X) X) ŷ ŷ = (W 9 · A) A) a 6 = 7 = 8 = 6 · 7 · 8 · 9 · where A = [1, a 6 where A = [1, a 6 , a , a 7 7 , a , a 8 8 ] is called the vector of ] is called the vector of hidden unit activitations hidden unit activitations Original motivation: Differentiable approximation to multi- -layer LTUs layer LTUs Original motivation: Differentiable approximation to multi

Representational Power Representational Power Any Boolean Formula Any Boolean Formula – Consider a formula in disjunctive normal form: Consider a formula in disjunctive normal form: – (x 1 ¬ x x 2 ) ∨ ∨ (x (x 2 ∧ x x 4 ) ∨ ∨ ( ( ¬ ¬ x x 3 ∧ x x 5 ) (x 2 ) 4 ) 5 ) 1 ∧ ∧ ¬ 2 ∧ 3 ∧ Each AND can be represented by a hidden unit and the OR can Each AND can be represented by a hidden unit and the OR can be represented by the output unit. Arbitrary boolean functions be represented by the output unit. Arbitrary boolean functions require exponentially- require exponentially -many hidden units, however. many hidden units, however. Bounded functions Bounded functions – Suppose we make the output linear: – Suppose we make the output linear: ŷ ŷ = W = W 9 · A of hidden units. A of hidden units. 9 · It can be proved that any bounded continuous function can be It can be proved that any bounded continuous function can be approximated to arbitrary accuracy with enough hidden units. approximated to arbitrary accuracy with enough hidden units. Arbitrary Functions Arbitrary Functions – Any function can be approximated to arbitrary accuracy with two Any function can be approximated to arbitrary accuracy with two – hidden layers of sigmoid units and a linear output unit. hidden layers of sigmoid units and a linear output unit.

Fixed versus Variable Size Fixed versus Variable Size In principle, a network has a fixed number of parameters In principle, a network has a fixed number of parameters and therefore can only represent a fixed hypothesis and therefore can only represent a fixed hypothesis space (if the number of hidden units is fixed). space (if the number of hidden units is fixed). However, we will initialize the weights to values near However, we will initialize the weights to values near zero and use gradient descent. The more steps of zero and use gradient descent. The more steps of gradient descent we take, the more functions can be gradient descent we take, the more functions can be “reached reached” ” from the starting weights. from the starting weights. “ So it turns out to be more accurate to treat networks as So it turns out to be more accurate to treat networks as having a variable hypothesis space that depends on the having a variable hypothesis space that depends on the number of steps of gradient descent number of steps of gradient descent

Backpropagation: Gradient Backpropagation: Gradient Descent for Multi- -Layer Networks Layer Networks Descent for Multi It is traditional to train neural networks to minimize the squared It is traditional to train neural networks to minimize the squar ed error. This is really a mistake— —they should be trained to maximize they should be trained to maximize error. This is really a mistake the log likelihood instead. But we will study the MSE first. the log likelihood instead. But we will study the MSE first. y = σ ( W 9 · [1 , σ ( W 6 · X ) , σ ( W 7 · X ) , σ ( W 9 · X )]) ˆ J i ( W ) = 1 y i − y i ) 2 2(ˆ We must apply the chain rule many times to compute the gradient We must apply the chain rule many times to compute the gradient We will number the units from 0 to U and index them by u u and and v v . . We will number the units from 0 to U and index them by w v,u will be the weight connecting unit u u to unit to unit v. v. (Note: This seems (Note: This seems w v,u will be the weight connecting unit backwards. It is the u u th input to node th input to node v v .) .) backwards. It is the

Derivation: Output Unit Derivation: Output Unit Suppose w w 9,6 is a component of W W 9 , the Suppose 9,6 is a component of 9 , the output weight vector, connecting it from a a 6 . output weight vector, connecting it from 6 . ∂ J i ( W ) 1 ∂ y i − y i ) 2 = 2(ˆ ∂ w 9 , 6 ∂ w 9 , 6 1 ∂ = 2 · 2 · (ˆ y i − y i ) · ( σ ( W 9 · A i ) − y i ) ∂ w 9 , 6 ∂ = (ˆ y i − y i ) · σ ( W 9 · A i )(1 − σ ( W 9 · A i )) · W 9 · A i ∂ w 9 , 6 = (ˆ y i − y i )ˆ y i (1 − ˆ y i ) · a 6

The Delta Rule The Delta Rule Define Define δ 9 = (ˆ y i − y i )ˆ y i (1 − ˆ y i ) then then ∂ J i ( W ) = (ˆ y i − y i )ˆ y i (1 − ˆ y i ) · a 6 ∂ w 9 , 6 = δ 9 · a 6

Derivation: Hidden Units Derivation: Hidden Units ∂ J i ( W ) ∂ = (ˆ y i − y i ) · σ ( W 9 · A i )(1 − σ ( W 9 · A i )) · W 9 · A i ∂ w 6 , 2 ∂ w 6 , 2 ∂ = δ 9 · w 9 , 6 · σ ( W 6 · X ) ∂ w 6 , 2 ∂ = δ 9 · w 9 , 6 · σ ( W 6 · X )(1 − σ ( W 6 · X )) · ( W 6 · X ) ∂ w 6 , 2 = δ 9 · w 9 , 6 a 6 (1 − a 6 ) · x 2 Define δ 6 = δ 9 · w 9 , 6 a 6 (1 − a 6 ) and rewrite as ∂ J i ( W ) = δ 6 x 2 . ∂ w 6 , 2

Networks with Multiple Output Units Networks with Multiple Output Units ŷ 1 ŷ 2 a 7 a 8 a 6 1 W 6 W 7 W 8 1 x 1 x 2 x 3 x 4 We get a separate contribution to the gradient from each We get a separate contribution to the gradient from each output unit. output unit. Hence, for input- -to to- -hidden weights, we must sum up the hidden weights, we must sum up the Hence, for input contributions: contributions: 10 X δ 6 = a 6 (1 − a 6 ) w u, 6 δ u u =9

The Backpropagation Algorithm The Backpropagation Algorithm Forward Pass. Compute . Compute a a u and ŷ ŷ v for hidden units u u and and Forward Pass u and v for hidden units output units v v . . output units ε v . Compute ε Compute Errors. Compute = ( ŷ ŷ v – y y v ) for each output Compute Errors v = ( v – v ) for each output unit v v unit δ u ∑ v δ v . Compute δ ) ∑ v,u δ Compute Output Deltas. Compute = a a u (1 – – a a u w v,u Compute Output Deltas u = u (1 u ) v w v Compute Gradient. . Compute Gradient ∂ J i – Compute for input – Compute for input- -to to- -hidden weights. hidden weights. = δ u x ij ∂ w u,j ∂ J i – Compute for hidden – Compute for hidden- -to to- -output weights. output weights. = δ v a iu ∂ w v,u Take Gradient Step. . Take Gradient Step W := W − η ∇ W J ( x i )

Proper Initialization Proper Initialization Start in the “ “linear linear” ” regions regions Start in the – keep all weights near zero, so that all sigmoid units are in the keep all weights near zero, so that all sigmoid units are in their ir – linear regions. This makes the whole net the equivalent of one linear regions. This makes the whole net the equivalent of one linear threshold unit— —a very simple function. a very simple function. linear threshold unit Break symmetry. Break symmetry. – Ensure that each hidden unit has different input weights so that Ensure that each hidden unit has different input weights so that – the hidden units move in different directions. the hidden units move in different directions. Set each weight to a random number in the range Set each weight to a random number in the range 1 [ − 1 , +1] × √ fan-in . where the “ “fan fan- -in in” ” of weight of weight w w v,u is the number of inputs where the v,u is the number of inputs to unit v v . . to unit

Batch, Online, and Online with Batch, Online, and Online with Momentum Momentum Batch. Sum the for each example . Sum the for each example i i . . Batch ∇ W J ( x i ) Then take a gradient descent step. Then take a gradient descent step. Online. Take a gradient descent step with each . Take a gradient descent step with each Online as it is computed. as it is computed. ∇ W J ( x i ) Momentum. Maintain an exponentially . Maintain an exponentially- -weighted weighted Momentum moved sum of recent moved sum of recent ∆ W ( t +1) := µ ∆ W ( t ) + ∇ W J ( x i ) W ( t +1) := W ( t ) − η ∆ W ( t +1) µ are in the range [0.7, 0.95] Typical values of µ are in the range [0.7, 0.95] Typical values of

Learning Neural Networks Learning Neural Networks Neural Networks - PowerPoint PPT Presentation

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural Networks can represent complex decision boundaries decision boundaries Variable size. Any boolean function can be Variable size. Any boolean

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review:

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Chapter 10: Artificial Neural Networks Dr. Xudong Liu Assistant Professor School of Computing

Deep Learning for Classification CS293S, Yang, 2017 Computational graph for classification w 1 f

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of

Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron A

Sigmoid: ATwistedTaleofFluxandFields ByTylerBehm