Neural Networks: Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP February 27 2020 1
Today’s Outline ● Computation: the forward pass ● Functional form / matrix notation ● Parameters and Hyperparameters ● Gradient Descent ● Intro ● Stochastic Gradient Descent + Mini-batches 2
̂ ̂ ̂ Notation ● I will generally use plain variables (e.g. x , y , W ) for vectors and matrices as well as scalars, relying on context ● y y : a “guess” at ● e.g.: a model’s output ● f ( x ) x f , when is a vector/matrix means that is applied element-wise ● θ : all parameters ● y = f ( x ; θ ) = f θ ( x ) θ y x : is a (parameterized) function of with parameters 3
Feed-forward networks aka Multi-layer perceptrons (MLP) 4
XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or 5
XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or a or = σ ( w or ⋅ a p + w or ⋅ a q + b or ) p q a nand = σ ( w nand ⋅ a p + w nand ⋅ a q + b nand ) p q 6
XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or w or w or a q ] + [ b or a p [ a nand ] = σ a or [ b nand ] p q w nand w nand p q 7
XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or w or w or a q ] + [ b or a p a and = σ [ w and [ b nand ] + b and p q w and nand ] σ or w nand w nand p q 8
̂ ̂ Generalizing w or w or a q ] + [ b or a p a and = σ [ w and [ b nand ] + b and p q w and nand ] σ or w nand w nand p q y = f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) 9
Some terminology ● Our XOR network is a feed-forward neural network with one hidden layer ● Aka a multi-layer perceptron (MLP) ● Input nodes: 2; output nodes: 1 ● Activation function: sigmoid 10
General MLP source w 1 i Weight to neuron in layer 1 ij j from neuron in layer 0 W 1 11
̂ General MLP y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) w 1 w 1 w 1 b 1 ⋯ x 0 00 01 0 n 0 0 x 1 b 1 w 1 w 1 w 1 ⋯ b 1 = W 1 = 1 10 11 1 n 0 x = ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 b 1 w 1 w 1 w 1 ⋯ n 1 n 1 0 n 1 1 n 1 n 0 Shape: ( n 0 ,1) Shape: ( n 1 ,1) ( n 1 , n 0 ) Shape: n 0 : number of neurons in layer 0 (input) 12 n 1 : number of neurons in layer 1
Parameters of an MLP ● Weights and biases ● For each layer : l n l ( n l − 1 + 1) ● n l n l − 1 n l weights; biases ● With n hidden layers (considering the output as a hidden layer): n ∑ n i ( n i − 1 + 1) i =1 13
Hyper-parameters of an MLP ● Input size, output size ● Usually fixed by your problem / dataset ● Input: image size, vocab size; number of “raw” features in general ● Output: 1 for binary classification or simple regression, number of labels for classification, … ● Number of hidden layers ● For each hidden layer: ● Size ● Activation function ● Others: initialization, regularization (and associated values), learning rate / training, … 14
The Deep in Deep Learning ● The Universal Approximation Theorem says that one hidden layer suffices for arbitrarily-closely approximating a given function ● Empirical drawbacks: Super-exponentially many neurons; hard to discover ● “Deep and narrow” >> “Shallow and wide” ● In principle allows hierarchical features to be learned ● More well-behaved w/r/t optimization source 15
̂ Activation Functions ● Note: non-linear activation functions are essential ● MLP: linear transformation, followed by a point-wise non-linearity, repeated several times over ● Without the non-linearity, would just have several linear transformations ● Composition of linear transformations is also linear! y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) 16
Activation Functions: Hidden Layer sigmoid tanh tanh ( x ) = e x − e − x e x 1 e x + e − x = 2 σ (2 x ) − 1 ● Use ReLU by default σ ( x ) = 1 + e − x = e x + 1 ● Generalizations: ● Leaky ● ELU Problem: derivative “saturates” (nearly 0) ● Softplus everywhere except near origin ● … 17
Activation Functions: Output Layer ● Depends on the task! ● Regression (continuous output(s)): none! ● Just use final linear transformation ● Binary classification: sigmoid ● Also for multi-label classification e x i softmax ( x ) i = ● Multi-class classification: softmax ∑ j e x j ● Terminology: the inputs to a softmax are called logits ● [there are sometimes other uses of the term, so beware] 18
Learning: (Stochastic) Gradient Descent 19
Gradient Descent: Basic Idea ● Treat NN training as an optimization problem 1 ℒ ( ̂ | Y | ∑ ℓ ( ̂ ℓ ( ̂ y , y ) Y , Y ) = y ( x i ), y i ) : loss function (“objective function”); ● ● How “close” is the model’s output to the true output i ● Local loss, averaged over training instances ● More later: depends on the particular task, among other things ● View the loss as a function of the model’s parameters ● The gradient of the loss w/r/t parameters tells which direction in parameter space to “walk” to make the loss smaller (i.e. to improve model outputs) ● Guaranteed to work in linear case; can get stuck in local minima for NNs 20
Gradient Descent: Basic Idea source 21
Derivatives ● The derivative of a function of one real variable measures how much the output changes with respect to a change in the input variable f ( x ) = x 2 + 35 x + 12 df dx = 2 x + 35 f ( x ) = e x df dx = e x 22
Partial Derivatives ● A partial derivative of a function of several variables measures its derivative with respect one of those variables, with the others held constant. f ( x ) = 10 x 3 y 2 + 5 xy 3 + 4 x + y ∂ f ∂ x = 30 x 2 y 2 + 5 y 3 + 4 ∂ f ∂ y = 20 x 3 y + 15 xy 2 + 1 23
Gradient ● The gradient of a function f ( x 1 , x 2 , . . . x n ) is a vector function, returning all of the partial derivatives ∇ f = ⟨ ∂ x n ⟩ ∂ f , ∂ f , …, ∂ f ∂ x 1 ∂ x 2 f ( x ) = 4 x 2 + y 2 ∇ f = ⟨ 8 x ,2 y ⟩ ● The gradient is perpendicular to the level curve at a point ● The gradient points in the direction of greatest rate of increase of f 24
Gradient and Level Curves (0, 5) f ( x ) = 4 x 2 + y 2 (1,1) ∇ f = ⟨ 8 x ,2 y ⟩ ( 1.25,0) Level curves: f ( x ) = c Q: what are the actual gradients at those points? 25
Gradient Descent and Level Curves source 26
Gradient Descent Algorithm ● Initialize θ 0 ● Repeat until convergence: θ n +1 = θ n − α ∇ℒ ( ̂ Y ( θ n ), Y ) Learning rate ● High learning rate: big steps, may bounce and “overshoot” the target ● Low learning rate: small steps, smoother minimization of loss, but can be slow 27
̂ Gradient Descent: Minimal Example ● Task: predict a target/true value y = 2 ● “Model”: y ( θ ) = θ ● A single parameter: the actual guess ● Loss: Euclidean distance y − y ) 2 = ( θ − y ) 2 ℒ ( ̂ y ( θ ), y ) = ( ̂ 28
Gradient Descent: Minimal Example 29
Stochastic Gradient Descent ● The above is called “batch” gradient descent ● Updates once per pass through the dataset ● Expensive, and slow; does not scale well ● Stochastic gradient descent: ● Break the data into “mini-batches”: small chunks of the data ● Compute gradients and update parameters for each batch ● Mini-batch of size 1 = single example ● A noisy estimate of the true gradient, but works well in practice; more parameter updates ● Epoch: one pass through the whole training data 30
Stochastic Gradient Descent initialize parameters / build model for each epoch: data = shuffle(data) batches = make_batches(data) for each batch in batches: outputs = model(batch) loss = loss_fn(outputs, true_outputs) compute gradients // e.g. loss.backward() update parameters 31
Computing with Mini-batches ● Bad idea: for each batch in batches: for each datum in batch: outputs = model(datum) loss = loss_fn(outputs, true_outputs) compute gradients // e.g. loss.backward() update parameters 32
̂ Computing with a Single Input y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) w 1 w 1 w 1 b 1 ⋯ x 0 00 01 0 n 0 0 x 1 b 1 w 1 w 1 w 1 ⋯ b 1 = W 1 = 1 10 11 1 n 0 x = ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 b 1 w 1 w 1 w 1 ⋯ n 1 n 1 0 n 1 1 n 1 n 0 Shape: ( n 0 ,1) Shape: ( n 1 ,1) ( n 1 , n 0 ) Shape: n 0 : number of neurons in layer 0 (input) 33 n 1 : number of neurons in layer 1
Recommend
More recommend