IAML: Artificial Neural Networks Chris Williams and Victor Lavrenko School of Informatics Semester 1 1 / 26
Outline ◮ Why multilayer artificial neural networks (ANNs)? ◮ Representation Power of ANNs ◮ Training ANNs: backpropagation ◮ Learning Hidden Layer Representations ◮ Examples ◮ Recurrent Neural Networks ◮ W & F sec 6.3, multilayer perceptrons, backpropagation (details on pp 230-232 not required), radial basis function networks 2 / 26
Why we need multilayer networks ◮ Networks without hidden units are very limited in the input-output mappings they can represent ◮ More layers of linear units do not help, it is still linear ◮ Fixed non-linearities φ ( x ) are problematic; what are good basis functions to choose ? � f ( x ) = g w j φ j ( x ) j ◮ We get more power from multiple layers of adaptive non-linear hidden units 3 / 26
Artificial Neural Networks (ANNs) ◮ The field of neural networks grew up out of simple models of neurons ◮ Research was done into what networks of these neurons could achieve ◮ Neural networks proved to be a reasonable modelling tool ◮ Which is funny really as they never were very good models of neurons... or of neural networks ◮ But when understood in terms of learning from data, they proved to be powerful 4 / 26
An example network with 2 hidden layers . . . output layer . . . hidden layer 2 . . . hidden layer 1 input layer (x) 5 / 26
◮ There can be an arbitrary number of hidden layers ◮ Each unit in the first hidden layer computes a non-linear function of the input x ◮ Each unit in a higher hidden layer computes a non-linear function of the outputs of the layer below ◮ Common choices for the hidden-layer non-linearities are the logistic function g ( z ) = 1 / ( 1 + e − z ) or the Gaussian function ◮ Logistic nonlinearity → multilayer perceptron (MLP) ◮ Gaussian nonlinearity → radial basis function (RBF), normally only 1 hidden layer 6 / 26
◮ Output units compute a linear combination of the outputs of the final hidden layer and pass it through a transfer function g () ◮ g is the identity function for a regression task (cf linear regression) ◮ g is the logistic function for a two-class classification task (cf logistic regression) 7 / 26
Representation Power of ANNs ◮ Boolean functions: ◮ Every boolean function can be represented by network with single hidden layer ◮ but might require exponential (in number of inputs) hidden units ◮ Continuous functions: ◮ Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] ◮ Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. ◮ Neural Networks are universal approximators . 8 / 26
ANN predicting 1 of 10 vowel sounds based on formats F1 and F2 Figure from Mitchell (1997) 9 / 26
Limitations of Representation Power Results ◮ The fact that a function is representable does not tell us how many hidden units would be required for its approximation ◮ Nor does it tell us if it is learnable (a search problem) ◮ Nor does it say anything about how much training data would be needed to learn the function ◮ In fact universal approximation has only a limited benefit: need bias 10 / 26
Training ANNs ◮ As in linear and logistic regression, we create an error function that measures the agreement of the target y ( x ) and the prediction f ( x ) ◮ Linear regression, squared error: E = � n i = 1 ( y i − f ( x i )) 2 ◮ Logistic regression (0/1 labels): E = � n i = 1 y i log f ( x i ) + ( 1 − y i ) log ( 1 − f ( x i )) ◮ These are both related to the log likelihood of the data under the relevant model ◮ For linear and logistic regression the optimization problem for w had a unique optimum; this is no longer the case for ANNs (e.g. hidden layer neurons can be permuted) 11 / 26
Backpropagation ◮ As discussed for logistic regression, we need the gradient of E wrt all the parameters w , i.e. g ( w ) = ∂ E ∂ w ◮ This is in fact an exercise in using the chain rule to compute derivatives; for ANNs this is given the name backpropagation ◮ We make use of the layered structure of the net to compute the derivatives, heading backwards from the output layer to the inputs ◮ Once you have g ( w ) , you can use your favourite optimization routines to minimize E ; see discussion of gradient descent and other methods in Logistic Regression slides ◮ It can make sense to use a regularization penalty (e.g. λ | w | 2 ) to help control overfitting 12 / 26
Batch vs online ◮ Batch learning: use all patterns in training set, and update weights after calculating ∂ E ∂ E i � ∂ θ = ∂ θ i ◮ On-line learning: adapt weights after each pattern presentation, using ∂ E i ∂ θ ◮ Batch more powerful optimization methods ◮ Batch easier to analyze ◮ On-line more feasible for huge or continually growing datasets ◮ On-line may have ability to jump over local optima 13 / 26
Convergence of Backpropagation ◮ Dealing with local minima. Train multiple nets from different starting places, and then choose best (or combine in some way) ◮ Initialize weights near zero; therefore, initial networks are near-linear ◮ Increasingly non-linear functions possible as training progresses 14 / 26
Training ANNs: Summary ◮ Optimize over vector of all weights/biases in a network ◮ All methods considered find local optima ◮ Gradient descent is simple but slow ◮ In practice, second-order methods ( conjugate gradients ) are used for batch learning ◮ Overfitting can be a problem 15 / 26
Fitting this into the general structure for learning algorithms: ◮ Define the task : classification or regression, discriminative ◮ Decide on the model structure : ANN ◮ Decide on the score function : log likelihood ◮ Decide on optimization/search method to optimize the score function: numerical optimization routine 16 / 26
Hypothesis space and Inductive Bias for ANNs ◮ Hypothesis space : if there are | w | weights and biases � w | w ∈ R | w | � H = ◮ Inductive Bias : hard to characterize, depends on search procedure, regularization and how weight space spans the space of representable functions ◮ Approximate statement: smooth interpolation between data points 17 / 26
Learning Hidden Layer Representations ◮ Backprop can develop intermediate representations of its inputs in the hidden layers ◮ These new features will capture properties of the input instances that are most relevant to learning the target function ◮ This ability to automatically discover useful hidden-layer representations is a key feature of ANN learning 18 / 26
Example 1: Neural Net Language Models Y Bengio et al, JMLR 3, 1137-1155 (2003) ◮ Predict word w t given preceeding words w t − 1 , w t − 2 etc ◮ Simple way is to estimate the trigram model count ( abc ) p ( w t = c | w t − 1 = b , w t − 2 = a ) = c ′ count ( abc ′ ) � ◮ Can’t use bigger context due to sparse data problems ◮ But this method uses no sharing across related words; we want a feature-based representation , so that e.g. cat and dog may share some features 19 / 26
Figure credit: Bengio et al, 2003 20 / 26
◮ Learned distributed encoding of each context word ◮ These are transformed by a hidden layer, followed by ◮ Softmax distribution over all possible words ◮ Predictive performance measured by perplexity (the geometric average of 1 / p ( w t | context ) ◮ Neural network is about 24% better on Brown corpus, 8% better on AP corpus than the best n-gram results 21 / 26
Example 2: Le Net e.g. LeCun and Bengio, 1995 ◮ Task is to recognize handwritten digits ◮ “Le Net” is a multilayer backprop net which has many hidden layers ◮ Alternation of convolutional features, followed by subsampling ◮ Final output is a softmax over the 10 classes Figure credit: LeCun et al, 1995 22 / 26
◮ The convolutional approach allows the net to identify certain features, even if they have been shifted in the image ◮ Subsampling affords a small amount of translational invariance at each stage ◮ Convolutional nets give the best performance on the MNIST dataset (best is now 0.39% error) 23 / 26
Recurrent Neural Networks Connectivity does not have to be feedforward, there can be directed cycles. This can give rise to richer behaviour: ◮ The network can oscillate—good for motor control? ◮ It can converge to a point attractor: good for classification? ◮ It can behave chaotically: but this is usually a bad idea for information processing ◮ It can use activities as hidden state, to remember things for a long time 24 / 26
’’ ’’ V V 1 2 w 12 w 12 w w 11 22 w V V w w 11 1 2 22 21 ’ ’ V V 1 2 w w 21 12 w w 11 22 w 21 V V 1 2 ◮ Recurrent networks can also be trained using backpropagation 25 / 26
ANNs: Summary ◮ Artificial neural networks are a powerful nonlinear modelling tool for classification and regression ◮ Trained by optimization methods making use of the backpropagation algorithm to compute derivatives ◮ Local optima in optimization are present, cf linear and logistic regression (and kernelized versions thereof, e.g. SVM) ◮ Ability to automatically discover useful hidden-layer representations 26 / 26
Recommend
More recommend