Deep learning Deep learning Introduction to neural networks Hamid Beigy Sharif university of technology September 30, 2019 Hamid Beigy | Sharif university of technology | September 30, 2019 1 / 1
Deep learning Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 2 / 1
Deep learning | Brain Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 2 / 1
Deep learning | Brain Brain Hamid Beigy | Sharif university of technology | September 30, 2019 3 / 1
Deep learning | Brain Functions of different parts of Brain 2 1 3 4 5 12 6 11 7 8 10 9 Hamid Beigy | Sharif university of technology | September 30, 2019 4 / 1
Deep learning | Brain Brain network Hamid Beigy | Sharif university of technology | September 30, 2019 5 / 1
Deep learning | Brain Neuron Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Hamid Beigy | Sharif university of technology | September 30, 2019 6 / 1
Deep learning | History of neural networks Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 6 / 1
Deep learning | History of neural networks McCulloch and Pitts network (1943) 1 The first model of a neuron was invented by McCulloch (physiologists) and Pitts (logician). 2 Inputs are binary. 3 This neuron has two types of inputs: Excitatory inputs (shown by a ) and Inhibitory inputs(shown by b ). 4 The output is binary: fires (1) and not fires (0). 5 Until the inputs summed up to a certain threshold level, the output would remain zero. Hamid Beigy | Sharif university of technology | September 30, 2019 7 / 1
Deep learning | History of neural networks McCulloch and Pitts network (logic functions) a 1 2 AND a 1 . . . a 2 a n a 1 c t+1 θ 1 OR b 1 . a 2 . . b m b 0 NOT 1 Hamid Beigy | Sharif university of technology | September 30, 2019 8 / 1
Deep learning | History of neural networks Perceptron (Frank Rosenblat (1958)) 1 Problems with McCulloch and Pitts -neurons Weights and thresholds are analytically determined (cannot learn them). Very difficult to minimize size of a network. What about non-discrete and/or non-binary tasks? 2 Perceptron solution. Weights and thresholds can be determined analytically or by a learning algorithm. Continuous, bipolar and multiple-valued versions. Rosenblatt randomly connected the perceptrons and changed the weights in order to achieve learning. Efficient minimization heuristics exist. Hamid Beigy | Sharif university of technology | September 30, 2019 9 / 1
Deep learning | History of neural networks Perceptron (Frank Rosenblat (1958)) Simplified mathematical model • Number of inputs combine linearly 1 Let y be the correct output, and f ( x ) the output function of the – Threshold logic: Fire if combined input exceeds network. Perceptron updates weights (Rosenblatt 1960) threshold w ( t ) ← w ( t ) + α x j ( y − f ( x )) j j 2 McCulloch and Pitts neuron is a better model for the electrochemical process inside the neuron than the Perceptron. 3 But Perceptron is the basis and building block for the modern neural networks. 70 Hamid Beigy | Sharif university of technology | September 30, 2019 10 / 1
Deep learning | History of neural networks Adaline (Bernard Widrow and Ted Hoff (1960) ) 1 The model is same as perceptron, but uses different learning algorithm 2 A multilayer network of Adaline units is known as a MAdaline. Hamid Beigy | Sharif university of technology | September 30, 2019 11 / 1
Deep learning | History of neural networks Adaline learning (Bernard Widrow and Ted Hoff (1960)) 1 Let y be the correct output, and f ( x ) = ∑ n j =0 w j x j . Adaline updates weights w ( t +1) ← w ( t ) + α x j ( y − f ( x )) j j 2 The Adaline converges to the least squares error which is ( y − f ( x )) 2 . This update rule is in fact the stochastic gradient descent update for linear regression. 3 In the 1960’s, there were many articles promising robots that could think. 4 It seems there was a general belief that perceptron could solve any problem. Hamid Beigy | Sharif university of technology | September 30, 2019 12 / 1
Deep learning | History of neural networks Minsky and Papert (1968) Perceptron 1 Minsky and Papert published their book Perceptrons. The book No solution for XOR! shows that perceptrons could only solve linearly separable problems. Not universal! 2 They showed that it is not possible for perceptron to learn an XOR function. X ? ? ? Y • Minsky and Papert, 1968 3 After Perceptrons was published, researchers lost interest in perceptron and neural networks. 74 Hamid Beigy | Sharif university of technology | September 30, 2019 13 / 1
Deep learning | History of neural networks Multi-layer Perceptron (Minsky and Papert (1968)) Multi-layer Perceptron! X 1 1 1 -1 2 1 1 -1 -1 Y Hidden Layer • XOR – The first layer is a “hidden” layer The first layer is a hidden layer. – Also originally suggested by Minsky and Paper 1968 Hamid Beigy | Sharif university of technology | September 30, 2019 76 14 / 1
Deep learning | History of neural networks History 1 Optimization 1 In 1969, Bryson and Ho described proposed Backpropagation as a multi-stage dynamic system optimization method. 2 In 1972, Stephen Grossberg proposed networks capable of learning XOR function. 3 In 1974, Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams reinvented Backpropagation and applied in the context of neural networks. Back propagation allowed perceptrons to be trained in a multilayer configuration. 2 In 1980s, the filed of artificial neural network research experienced a resurgence. 3 In 2000s, neural networks fell out of favor partly due to BP limitations. 4 In 2010, we are now able to train much larger networks using huge modern computing power such as GPUs. Hamid Beigy | Sharif university of technology | September 30, 2019 15 / 1
Deep learning | History of neural networks History Hamid Beigy | Sharif university of technology | September 30, 2019 16 / 1
Deep learning | Gradient based learning Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 16 / 1
Deep learning | Gradient based learning Cost function 1 The goal of machine learning algorithms is to construct a model (hypothesis) that can be used to estimate y based on x . 2 Let the model be in form of h ( x ) = w 0 + w 1 x 3 The goal of creating a model is to choose parameters so that h ( x ) is close to y for the training data, x and y . 4 We need a function that will minimize the parameters over our dataset. A function that is often used is mean squared error, m J ( w ) = 1 ∑ ( h ( x i ) − y i ) 2 2 m i =1 5 How do we find the minimum value of cost function? Hamid Beigy | Sharif university of technology | September 30, 2019 17 / 1
Deep learning | Gradient based learning Gradient descent 1 Gradient descent is by far the most popular optimization strategy, used in machine learning and deep learning at the moment. 2 Cost (error) is a function of the weights (parameters). 3 We want to reduce/minimize the error. 4 Gradient descent: move towards the error minimum. 5 Compute gradient, which implies get direction to the error minimum. 6 Adjust weights towards direction of lower error. Hamid Beigy | Sharif university of technology | September 30, 2019 18 / 1
Deep learning | Gradient based learning Gradient descent Hamid Beigy | Sharif university of technology | September 30, 2019 19 / 1
Deep learning | Gradient based learning Gradient descent (Linear Regression) 1 We have the following hypothesis and we need fit to the training data h ( x ) = w 0 + w 1 x 2 We use a cost function such Mean Squared Error m J ( w ) = 1 ∑ ( h ( x i ) − y i ) 2 2 m i =1 3 This cost function can be minimized using gradient descent. − α∂ J ( w ( t ) ) w ( t +1) = w ( t ) 0 0 ∂ w 0 − α∂ J ( w ( t ) ) w ( t +1) = w ( t ) 1 1 ∂ w 1 α is step (learning) rate. Hamid Beigy | Sharif university of technology | September 30, 2019 20 / 1
Deep learning | Gradient based learning Gradient descent (effect of learning rate) Hamid Beigy | Sharif university of technology | September 30, 2019 21 / 1
Deep learning | Gradient based learning Gradient descent (landscape of cost function) 40 1 . 7 z z 1 . 6 20 − 5 0 5 0 0 2 4 0 2 x 0 − 4 − 2 − 2 − 4 y − 5 5 y x Hamid Beigy | Sharif university of technology | September 30, 2019 22 / 1
Deep learning | Gradient based learning Challenges with gradient descent 1 Local minimim: A local minimum is a minimum within some neighborhood that need not be (but may be) a global minimum. 2 Saddle points: For non-convex functions, having the gradient to be 0 is not good enough. Example: f ( x ) = x 2 1 − x 2 2 at x = (0 , 0) has zero gradient but it is clearly not a local minimum as x = (0 , ϵ ) has smaller function value. The point (0 , 0) is called a saddle point of this function. Hamid Beigy | Sharif university of technology | September 30, 2019 23 / 1
Deep learning | Gradient based learning Challenges with gradient descent Hamid Beigy | Sharif university of technology | September 30, 2019 24 / 1
Recommend
More recommend