Choosing cost function: Examples } Regression problem + ๐น = 1 ๐น 5 56$ โ SSE ๐ฆ $ ๐น 5 = 1 ) 2 ๐ 5 โ ๐ง 5 One dimensional output ๐ง $ ๐น 5 = 1 > ) ) ๐ 5 โ ๐ 5 5 โ ๐ง ; 5 ๐ง > = 1 ๐ ; Multi-dimensional output 2 ๐ฆ 9 ;6$ } Classification problem โ Cross-entropy โข Binary classification ๐๐๐ก๐ก 5 = โ๐ง 5 log ๐ 5 โ (1 โ ๐ง 5 ) log(1 โ ๐ 5 ) Output layer uses sigmoid activation function 31
Multi-class output: One-hot representations โข Consider a network that must distinguish if an input is a cat, a dog, a camel, a hat, or a flower โข For inputs of each of the five classes the desired output is: Cat : [1 0 0 0 0 ] T dog : [0 1 0 0 0 ] T camel : [0 0 1 0 0 ] T hat : [0 0 0 1 0 ] T flower : [0 0 0 0 1 ] T โข For input of any class, we will have a five-dimensional vector output with four zeros and a single 1 at the position of the class โข This is a one hot vector 32
Multi-class networks Input Hidden Layers L ayer Output Layer โข For a multi-class classifier with N classes, the one-hot representation will have N binary outputs โ An N-dimensional binary vector โข The neural networkโs output too must ideally be binary (N-1 zeros and a single 1 in the right place) โข More realistically, it will be aprobability vector โ N probability values that sum to 1. 33
Vector activation example: Softmax โข Example: Softmax vector activation Parameters are weights and bias ๐ U 34
Vector Activations Input Hidden Layers L ayer Output Layer โข We can also have neuron that have multiple couple outputs ๐ง $ , ๐ง ) , โฆ , ๐ง V = ๐(๐ฆ $ , ๐ฆ ) , โฆ , ๐ฆ ; ; ๐ฟ) โ Function ๐(. ) operates on set of inputs to produce set of outputs โ Modifying a single parameter in ๐ฟ will affect all outputs 35
๏ฟฝ ๏ฟฝ Multi-class classification: Output Input Hidden Layers L ayer Output Layer s o f t m a x โข Softmax vector activation is often used at the output of multi-class classifier nets (V) ๐ Y (5[$) ๐จ U = 1 ๐ฅ YU Y ๐ U = exp (๐จ U ) โ exp (๐จ Y ) Y โข This can be viewed as the probability ๐ U = ๐ ๐๐๐๐ก๐ก = ๐ ๐ 36
๏ฟฝ For multi-class classification y 1 y 2 y 3 y 4 K L Div() E โข Desired output ๐ง is one hot vector 0 0 โฆ 1 โฆ 0 0 0 wit the 1 in the ๐ -th position(for class c) โข Actual output will be probability distribution [๐ $ , ๐ ) , โฆ , ๐ V ] โข The cross-entropy between the desired one-hot output and actual output ๐ ๐, ๐ = โ 1 ๐ง U ๐๐๐๐ U = โ๐๐๐๐ a U โข Derivative = bโ 1 ๐๐(๐, ๐) ๐๐๐ ๐ขโ๐ ๐ ๐ขโ ๐๐๐๐๐๐๐๐๐ข The slopeis negative w.r.t. ๐ a ๐ a ๐๐ U Indicates increasing ๐ a will reduce divergence 0 ๐๐๐ ๐ ๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐ข ๐ ๐(๐, ๐) = [0 0 โฆ โ1 ๐ผ โฆ 0 0 ] ๐ a 37
๏ฟฝ For multi-class classification y 1 y 2 y 3 y 4 K L Div() E โข Desired output ๐ง is one hot vector 0 0 โฆ 1 โฆ 0 0 0 wit the 1 in the ๐ -th position(for class c) โข Actual output will be probability distribution [๐ $ , ๐ ) , โฆ , ๐ V ] โข The cross-entropy between the desired one-hot output and actual output ๐ ๐, ๐ = โ 1 ๐ง U ๐๐๐๐ U = โ๐๐๐๐ a U โข Derivative The slopeis negative w.r.t. ๐ a = bโ 1 Indicates increasing ๐ a will reduce divergence ๐๐(๐, ๐) ๐๐๐ ๐ขโ๐ ๐ ๐ขโ ๐๐๐๐๐๐๐๐๐ข ๐ a ๐๐ U Note: when ๐ = ๐ the 0 ๐๐๐ ๐ ๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐ข derivative is not 0 ๐ ๐(๐, ๐) = [0 0 โฆ โ1 ๐ผ โฆ 0 0 ] ๐ a 38
Choosing cost function: Examples } Regression problem + ๐น = 1 ๐น 5 56$ โ SSE ๐ฆ $ ๐น 5 = 1 ) 2 ๐ 5 โ ๐ง 5 One dimensional output ๐ง $ ๐น 5 = 1 > ) ) ๐ 5 โ ๐ 5 5 โ ๐ง ; 5 ๐ง > = 1 ๐ ; Multi-dimensional output 2 ๐ฆ 9 ;6$ } Classification problem โ Cross-entropy 1 ๐๐๐ก๐ก 5 = โ๐ง 5 log ๐ 5 โ (1 โ ๐ง 5 ) log(1 โ ๐ 5 ) ๐ = โข Binary classification 1 + ๐ s โข Multi-class classification Output layer uses sigmoid activation function ๐๐๐ก๐ก 5 = โlog ๐ i (j) k lm Output is found by a softmax layer ๐ U = k ln o โ npq 40
Problem setup โข Given: the architecture of thenetwork โข Training data: A set of input-output pairs ๐ ($) , ๐ ($) , ๐ ()) , ๐ ()) , โฆ , (๐ (+) , ๐ (+) ) โข We need a loss function to show how penalizes the obtained output ๐ = ๐(๐; ๐ฟ) when the desired output is ๐ + + = 1 ๐น(๐ฟ) = 1 ๐๐๐ก๐ก ๐ (5) , ๐ (5) ๐ 1 ๐๐๐ก๐ก ๐ ๐ (5) ; ๐ฟ , ๐ (5) 56$ 56$ ; , ๐ [;] โข Minimize ๐น w.r.t. ๐ฟ that containts ๐ฅ U,Y Y 41
How to adjust weights for multi layer networks? โข We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? โ We need an efficient way of adapting all the weights, not just the last layer. โ Learning the weights going into hidden units is equivalent to learning features. โ This is difficult because nobody is telling us directly what the hidden units should do. 42
Find the weights by optimizing the cost โข Start from random weights and then adjust them iteratively to get lower cost. โข Update the weights according to the gradient of the cost function Source: http://3b1b.co 43
How does the network learn? โข Which changes to the weights do improve the most? ๐ผ๐น ๐ผ๐น โข The magnitude of each element shows how sensitive the cost is to that weight or bias. Source: http://3b1b.co 44
Training multi-layer networks โข Back-propagation โ Training algorithm that is used to adjust weights in multi-layer networks (based on the training data) โ The back-propagation algorithm is based on gradient descent โ Use chain rule and dynamic programming to efficiently compute gradients 45
Training Neural Nets through Gradient Descent Total training error: + ๐น = 1 ๐๐๐ก๐ก ๐ (5) , ๐ (5) 56$ โข Gradient descent algorithm Assuming the bias is also [;] โข Initialize all weights and biases ๐ฅ UY represented as a weight โ Using the extended notation : the bias is also weight โข Do : โ For every layer ๐ for all ๐, ๐ update: [;] = ๐ฅ U,Y [;] โ ๐ 9w โข ๐ฅ U,Y [y] 9x m,n โข Until ๐น has converged 46
The derivative Total training error: + ๐น = 1 ๐๐๐ก๐ก ๐ (5) , ๐ (5) 56$ โข Computing the derivative Total derivative: + [;] = 1 ๐๐๐ก๐ก ๐ (5) , ๐ (5) ๐๐น [;] ๐๐ฅ U,Y ๐๐ฅ U,Y 56$ 47
Training by gradient descent [;] โข Initialize all weights ๐ฅ UY โข Do : 9w โ For all ๐ , ๐ , ๐, initialize [y] = 0 9x m,n โ For all ๐ = 1: ๐ โข For every layer ๐ for all ๐, ๐ : 9 V{|| { j ,i j โข Compute [y] 9x m,n 9 V{|| { j ,i j 9w [y] += โข [y] 9x m,n 9x m,n โ For every layer ๐ for all ๐, ๐: [;] = ๐ฅ U,Y [;] โ } 9w ๐ฅ U,Y [y] T 9x m,n 48
The derivative Total training error: + ๐น = 1 ๐๐๐ก๐ก ๐ (5) , ๐ (5) 56$ Total derivative: + [;] = 1 ๐ ๐๐๐ก๐ก ๐ (5) , ๐ (5) ๐๐น [;] ๐๐ฅ U,Y ๐๐ฅ U,Y 56$ โข So we must first figure out how to compute the derivative of divergences of individual training inputs 49
Calculus Refresher: Basic rules of calculus ๐๐ง ๐๐ฆ โ ฮ๐ง with derivative dy ฮ๐ฆ dx the following must hold for sufficiently small For any differentiable function 1 2 M with partial derivatives โ y โ y โ y โ x 1 โ x 2 โ x M the following must hold for sufficiently small 1 2 M 50
Calculus Refresher: Chainrule For any nested function Check โwe can confirm that : 51
Calculus Refresher: Distributed Chain rule Check: 1 2 M 1 2 M 1 2 M 1 2 M 52
Distributed Chain Rule: Influence Diagram 1 1 2 2 M M โข ๐ฆ affects ๐ง through each ๐ $ , โฆ , ๐ โฌ 53
Distributed Chain Rule: Influence Diagram 1 1 1 2 M M M โข Small perturbations in cause small perturbations in each o each of which individually additively perturbs 54
Simple chain rule โข ๐จ = ๐ ๐ ๐ฆ โข ๐ง = ๐(๐ฆ) 55
Multiple paths chain rule 56
Backpropagation: a simple example 57
Backpropagation: a simple example 58
Backpropagation: a simple example 59
Backpropagation: a simple example 60
Backpropagation: a simple example 61
Backpropagation: a simple example 62
How to propagate the gradients backward ๐จ = ๐(๐ฆ, ๐ง) 63
How to propagate the gradients backward 64
Returning to ourproblem ๐ ๐๐๐ก๐ก ๐, ๐ โข How to compute [;] ๐๐ฅ U,Y 65
A first closer look at the network + + ๐ (.) ๐ (.) + ๐ ๐ (.) + + ๐ (.) ๐ (.) โข Showing a tiny 2-input network forillustration โ Actual network would have many more neurons and inputs โข Explicitly separating the weighted sum of inputs from the activation 66
A first closer look at the network (1) (2) + + 1,1 1,1 (3) (1) (2) 1,1 1,2 1,2 + ๐ (1) (2) (3) 2,1 2,1 2,1 (1) (2) + + 2,2 2,2 (3) (1) (1) (2) (2) 3,1 3,1 3,2 3,1 3,2 โข Showing a tiny 2-input network forillustration โ Actual network would have many more neurons and inputs โข Expanded with all weights andactivations shown โข The overall function is differentiable w.r.t every weight,bias and input 67
Backpropagation: Notation โข ๐ [โ] โ ๐ฝ๐๐๐ฃ๐ข โข ๐๐ฃ๐ข๐๐ฃ๐ข โ ๐ [โ ] ๐(. ) ๐(. ) ๐ [V[$] ๐ [V] ๐ [V] ๐(. ) ๐(. ) ๐(. ) 68
Backpropagation: Last layer gradient For squared error loss: ๐๐๐๐ก๐ก [โ ] = ๐ ๐จ U โ ) [โ ] ๐๐๐ก๐ก = 1 ๐ U ) โ = (๐ง Y โ ๐ Y 2 1 ๐ Y โ ๐ง Y โฌ ๐๐ Y [โ ] = 1 ๐ฅ UY Y [โ ] ๐ U [โ [$] ๐จ Y โ ๐ Y = ๐ Y U6โ โ ๐๐ Y ๐๐๐๐ก๐ก [โ ] = ๐๐๐๐ก๐ก Output j ๐๐๐๐ก๐ก [โ ] [โ ] โ ๐ Y โ ๐๐ฅ UY ๐๐ Y ๐๐ฅ UY ๐๐ Y ๐ [โ ] [โ ] ๐จ ๐๐จ ๐๐ [โ ] Y [โ ] ๐ U [โ ] = ๐ โฐ ๐จ Y [โ ] = ๐ โฐ ๐จ [โ ] [โ [$] ๐๐๐๐ก๐ก Y Y [โ ] ๐๐ฅ UY ๐๐ฅ UY ๐ฅ UY [โ ] ๐๐ฅ UY [V[$] ๐ U ๐๐๐๐ก๐ก [โ ] = ๐๐๐๐ก๐ก [โ ] ๐ U โ ๐ โฐ ๐จ [โ [$] Y ๐๐ฅ UY ๐๐ Y 69
Activations and theirderivatives 2 [*] โข Some popular activation functions andtheir derivatives 70
Previous layers gradients ๐ ๐๐๐ก๐ก [V] ๐ Y V [V] = ๐ ๐จ U ๐ ๐๐ Y [V] ๐ U [V] ๐จ V Y ๐๐ Y โฌ ๐ ๐๐๐ก๐ก [V] = ๐ ๐๐๐ก๐ก [V] = 1 ๐ฅ UY [V] ๐ U [V[$] ๐จ Y ๐ ๐๐๐ก๐ก [V] V ๐๐ฅ UY ๐๐ Y ๐๐ฅ UY [V] ๐ฅ UY [V] U6โ ๐๐ฅ UY [V[$] [V] [V] ๐ U ๐๐ [V] ๐๐ Y ๐๐จ [V] ๐ U [V] = ๐ โฐ ๐จ Y [V[$] Y [V] = [V] ร ๐๐ฅ UY ๐๐จ ๐๐ฅ UY Y ๐ ๐๐๐ก๐ก ๐ ๐๐๐ก๐ก [V] [V] 9 [โน] ๐๐ $ [V] [V] ๐๐ 9 [โน] [V] ๐๐ Y ๐๐จ Y ๐ Y ๐ ๐๐๐ก๐ก [V[$] = 1 ๐ ๐๐๐ก๐ก [V] ร [V] ร [V[$] ๐๐ U ๐๐ Y ๐๐จ Y ๐๐ U [V] ๐จ Y6$ Y 9 [โน] = 1 ๐ ๐๐๐ก๐ก [V] ร๐ฅ UY [V] ร๐ โฐ ๐จ Y [V] [V] ๐๐ Y ๐ฅ UY Y6$ ๐ ๐๐๐ก๐ก [V[$] ๐ U [V[$] ๐๐ U 71
Backpropagation: [V] ๐ Y [V] ๐๐ Y ๐ ๐๐๐ก๐ก [V] = ๐ ๐๐๐ก๐ก [V] = ๐ ๐จ U ๐ [V] ๐ U [V] ร [V] โฌ [V] ๐จ Y ๐๐ฅ UY ๐๐ Y ๐๐ฅ UY [V] = 1 ๐ฅ UY [V] ๐ U [V[$] ๐จ Y U6โ [V] [V[$] ร๐ โฐ ๐จ Y [V] ร๐ U [V] ๐ฅ UY = ๐ Y [V[$] ๐ U [V] = โข V{|| [V] } ๐ [โน] is the sensitivity of the output to ๐ Y Y โขลฝ n } Sensitivity vectors can be obtained by running a backward process in the network architecture (hence the name backpropagation.) We will compute ๐บ [V[$] from ๐บ [V] : 9 [โน] [V[$] = 1 ๐ [V] ร๐ฅ UY [V] ร๐ โฐ ๐จ [V] ๐ U Y Y Y6$ 72
Find and save ๐บ [โ ] โข Called error, computed recursively in backward manner โข For the final layer ๐ = ๐ : [โ ] = ๐ ๐๐๐ก๐ก ๐ Y [โ ] ๐๐ Y 73
Compute ๐บ [V[$] from ๐บ [V] [V] = โข V{|| [V] } ๐ [โน] is the sensitivity of the output to ๐ Y Y โขลฝ n [V] = ๐ ๐จ U [V] ๐ U โฌ } Sensitivity vectors can be obtained by running a backward process in [V] = 1๐ฅ UY [V] ๐ U [V[$] ๐จ Y the network architecture (hence the name backpropagation.) Y6โ [V] ๐ Y 9 [โน] ๐ [V[$] = 1 ๐ [V] ร๐ฅ UY [V] ร๐ โฐ ๐จ [V] ๐ U [V] Y Y ๐จ Y Y6$ [V] ๐ฅ UY [V[$] ๐ U 74
Backpropagation Algorithm โข Initialize all weights to small random numbers. โข While not satisfied โข For each training example do : 1. Feed forward the training example to the network and compute the outputs of all units in forward step (z and a) and the loss 2. For each unit find its ๐ in the backward step [V] as ๐ฅ UY [V] โ ๐ฅ UY [V] โ ๐ โข V{|| โข V{|| [V] ร๐ U [V[$] ร 3. Update each network weight ๐ฅ UY [โน] where [โน] = ๐ Y โขx mn โขx mn ๐ โฐ ๐จ [V] Y 75
Another example 76
Another example 77
Another example 78
Another example 79
Another example 80
Another example 81
Another example 82
Another example 83
Another example 84
Another example 85
Another example 86
Another example 87
Another example 88
Another example [local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 89
Derivative of sigmoid function 90
Derivative of sigmoid function 91
Patterns in backward flow โข add gate: gradient distributor โข max gate: gradient router 92
Modularized implementation: forward / backward API 93
Modularized implementation: forward / backward API 94
Modularized implementation: forward / backward API 95
Ve Vector formulation โข For layered networks it is generally simpler to think of the process in terms of vector operations โ Simpler arithmetic โ Fast matrix libraries make operations much faster โข We can restate the entire process in vector terms โ This is what is actually used in any real system 97
The Ja Th Jacobi bian โข The derivative of a vector function w.r.t. vector input is called a Jacobian โข It is the matrix of partial derivatives given below ๐ง $ ๐ฆ $ ๐๐ง $ ๐๐ง $ ๐๐ง $ โฏ ๐ง ) ๐ฆ ) ๐๐ฆ $ ๐๐ฆ ) ๐๐ฆ 9 = ๐ โฎ โฎ ๐๐ง ) ๐๐ง ) ๐๐ง ) ๐๐ โฏ ๐ง โ ๐ฆ 9 ๐๐ = ๐๐ฆ $ ๐๐ฆ ) ๐๐ฆ 9 โฏ โฏ โฑ โฏ Using vector notation ๐๐ง โ ๐๐ง โ ๐๐ง โ ๐ณ = ๐ ๐ฒ โฏ ๐๐ฆ $ ๐๐ฆ ) ๐๐ฆ 9 โ๐ณ = ๐๐ Check: ๐๐ โ๐ฒ 98
Matrix calculus ๐๐ง ๐๐ง ๐๐ง โข Scalar-by-Vector ๐๐ = โฆ ๐๐ฆ $ ๐๐ฆ 5 ๐๐ง $ ๐๐ง $ โฆ ๐๐ฆ $ ๐๐ฆ 5 ๐๐ โข Vector-by-Vector ๐๐ = โฎ โฑ โฎ ๐๐ง โ ๐๐ง โ โฆ ๐๐ฆ $ ๐๐ฆ 5 ๐๐ง ๐๐ง โฆ ๐๐ต $$ ๐๐ต $5 ๐๐ง โข Scalar-by-Matrix ๐๐ฉ = โฎ โฑ โฎ ๐๐ง ๐๐ง โฆ ๐๐ต โ$ ๐๐ต โ5 โข Vector-by-Matrix ๐๐ง = ๐๐ง ๐๐ ๐๐ต UY ๐๐ ๐๐ต UY 99
Vector-by-matrix gradients 100
Examples 101
Ja Jacob obians c can d des escribe t e the d e der erivatives es of n of neu eural a activation ons w. w.r.t their input z a ๐๐ $ 0 โฏ 0 ๐๐จ $ ๐๐ ) ๐๐ 0 โฏ 0 ๐๐ = ๐๐จ ) โฏ โฏ โฑ โฏ ๐๐ โ 0 0 โฏ ๐๐จ โ โข For Scalar activations โ Number of outputs is identical to the number of inputs โข Jacobian is a diagonal matrix โ Diagonal entries are individual derivatives of outputs w.r.t inputs โ Not showing the superscript โ[k]โ in equations for brevity 102
Ja Jacob obians c can d des escribe t e the d e der erivatives es of n of neu eural ac activ tivatio tions ns w.r.t t the their ir input input z a ๐ U = ๐ ๐จ U ๐โฒ ๐จ $ 0 โฏ 0 ๐๐ 0 ๐โฒ ๐จ ) โฏ 0 ๐๐ = โฏ โฏ โฑ โฏ 0 0 โฏ ๐โฒ ๐จ โ โข For scalar activations (shorthand notation): โ Jacobian is a diagonal matrix โ Diagonal entries are individual derivatives of outputs w.r.t inputs 103
Recommend
More recommend