machine learning chenhao tan
play

Machine Learning: Chenhao Tan University of Colorado Boulder - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 7 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 39 Final projects WSDM Cup SemEval 2018 Machine Learning:


  1. Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 7 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 39

  2. Final projects • WSDM Cup • SemEval 2018 Machine Learning: Chenhao Tan | Boulder | 2 of 39

  3. Overview Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Machine Learning: Chenhao Tan | Boulder | 3 of 39

  4. Forward propagation recap Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Machine Learning: Chenhao Tan | Boulder | 4 of 39

  5. Forward propagation recap Forward propagation algorithm Store the biases for layer l in b l , weight matrix in W l W 1 , b 1 W 2 , b 2 W 3 , b 3 x 1 x 2 o 1 . . . o 2 x d Machine Learning: Chenhao Tan | Boulder | 5 of 39

  6. Forward propagation recap Forward propagation algorithm Suppose your network has L layers Make a prediction based on test point x 1: Initialize a 0 = x 2: for l = 1 to L do z l = W l a l − 1 + b l 3: a l = g ( z l ) 4: 5: end for y is simply a L 6: The prediction ˆ Machine Learning: Chenhao Tan | Boulder | 6 of 39

  7. Forward propagation recap Neural networks in a nutshell • Training data S train = { ( x , y ) } • Network architecture (model) ˆ y = f w ( x ) W l , b l , l = 1 , . . . , L • Loss function (objective function) L ( y , ˆ y ) • How do we learn the parameters? Machine Learning: Chenhao Tan | Boulder | 7 of 39

  8. Forward propagation recap Neural networks in a nutshell • Training data S train = { ( x , y ) } • Network architecture (model) ˆ y = f w ( x ) W l , b l , l = 1 , . . . , L • Loss function (objective function) L ( y , ˆ y ) • How do we learn the parameters? Stochastic gradient descent, W l ← W l − η∂ L ( y , ˆ y ) ∂ W l Machine Learning: Chenhao Tan | Boulder | 7 of 39

  9. Forward propagation recap Challenge • Challenge : How the heck do we compute derivatives of the loss function with respect to weights and biases? • Solution : Back Propagation Machine Learning: Chenhao Tan | Boulder | 8 of 39

  10. Back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Machine Learning: Chenhao Tan | Boulder | 9 of 39

  11. Back propagation | Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. There are two forms of the Chain Rule Machine Learning: Chenhao Tan | Boulder | 10 of 39

  12. Back propagation | Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. There are two forms of the Chain Rule Baby Chain Rule : dx f ( g ( x )) = f ′ ( g ( x )) g ′ ( x ) = df d dg dg dx Machine Learning: Chenhao Tan | Boulder | 10 of 39

  13. Back propagation | Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. There are two forms of the Chain Rule Baby Chain Rule : dx f ( g ( x )) = f ′ ( g ( x )) g ′ ( x ) = df d dg dg dx d dx sin( x 2 ) = cos( x 2 ) 2 x Example: Machine Learning: Chenhao Tan | Boulder | 10 of 39

  14. Back propagation | Chain rule The Chain Rule Full-Grown Adult Chain Rule : x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Machine Learning: Chenhao Tan | Boulder | 11 of 39

  15. Back propagation | Chain rule The Chain Rule Full-Grown Adult Chain Rule : x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Derivative of L with respect to x : ∂ f ∂ x Similarly, ∂ f ∂ y , ∂ f ∂ z Machine Learning: Chenhao Tan | Boulder | 11 of 39

  16. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to r ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Machine Learning: Chenhao Tan | Boulder | 12 of 39

  17. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to r ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) ∂ f ∂ r = ∂ f ∂ x ∂ r + ∂ f ∂ r + ∂ f ∂ y ∂ z ∂ x ∂ y ∂ z ∂ r Machine Learning: Chenhao Tan | Boulder | 12 of 39

  18. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Machine Learning: Chenhao Tan | Boulder | 13 of 39

  19. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) ∂ f ∂ s = ∂ f ∂ x ∂ s + ∂ f ∂ s + ∂ f ∂ y ∂ z ∂ x ∂ y ∂ z ∂ s Machine Learning: Chenhao Tan | Boulder | 13 of 39

  20. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Example : Let f = xyz , x = r , y = rs , and z = s . Find ∂ f /∂ s ∂ f ∂ s = ∂ f ∂ x ∂ s + ∂ f ∂ s + ∂ f ∂ y ∂ z ∂ x ∂ y ∂ z ∂ s Machine Learning: Chenhao Tan | Boulder | 14 of 39

  21. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Example : Let f = xyz , x = r , y = rs , and z = s . Find ∂ f /∂ s ∂ f ∂ s = yz · 0 + xz · r + xy · 1 Machine Learning: Chenhao Tan | Boulder | 14 of 39

  22. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Example : Let f = xyz , x = r , y = rs , and z = s . Find ∂ f /∂ s ∂ f ∂ s = rs 2 · 0 + rs · r + r 2 s · 1 Machine Learning: Chenhao Tan | Boulder | 14 of 39

  23. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) r ( u , v ) u y ( r , s ) f ( x , y , z ) s ( u , v ) v z ( r , s ) Example : Let f = xyz , x = r , y = rs , and z = s . Find ∂ f /∂ s ∂ f ∂ s = 2 r 2 s Machine Learning: Chenhao Tan | Boulder | 14 of 39

  24. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to s ? x ( r , s ) r ( u , v ) u y ( r , s ) f ( x , y , z ) s ( u , v ) v z ( r , s ) Example : Let f = xyz , x = r , y = rs , and z = s . Find ∂ f /∂ s ∂ f f ( r , s ) = r · rs · s = r 2 s 2 ∂ s = 2 r 2 s � ⇒ Machine Learning: Chenhao Tan | Boulder | 15 of 39

  25. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to u ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Machine Learning: Chenhao Tan | Boulder | 16 of 39

  26. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to u ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) ∂ f ∂ f ∂ r ∂ u + ∂ f ∂ s = ∂ u ∂ r ∂ s ∂ u Machine Learning: Chenhao Tan | Boulder | 16 of 39

  27. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to u ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Crux : If you know derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. Machine Learning: Chenhao Tan | Boulder | 17 of 39

  28. Back propagation | Chain rule The Chain Rule What is the derivative of f with respect to u ? x ( r , s ) u r ( u , v ) y ( r , s ) f ( x , y , z ) v s ( u , v ) z ( r , s ) Crux : If you know derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. This is the cornerstone of the Back Propagation algorithm. Machine Learning: Chenhao Tan | Boulder | 17 of 39

  29. Back propagation | Back propagation Back Propagation W 1 , b 1 W 2 , b 2 W 3 , b 3 W 4 , b 4 x 1 x 2 o 1 . . . o 2 x d Machine Learning: Chenhao Tan | Boulder | 18 of 39

  30. Back propagation | Back propagation Back Propagation For the derivation, we’ll consider a simplified network L ( y , a 2 ) z 1 | a 1 W 2 z 2 | a 2 a 0 W 1 We want to use back propagation to compute partial derivative of L w.r.t. the weights and biases ∂ L , for l = 1 , 2 ∂ w 2 ij Machine Learning: Chenhao Tan | Boulder | 19 of 39

  31. Back propagation | Back propagation Back Propagation For the derivation, we’ll consider a simplified network L ( y , a 2 ) z 1 | a 1 W 2 z 2 | a 2 a 0 W 1 We need to choose an intermediate term that lives on the nodes, that we can easily compute derivative with respect to Could choose a ’s, but we’ll choose z ’s because math is easier Machine Learning: Chenhao Tan | Boulder | 19 of 39

  32. Back propagation | Back propagation Back Propagation For the derivation, we’ll consider a simplified network L ( y , a 2 ) z 1 | a 1 W 2 z 2 | a 2 a 0 W 1 Define the derivative w.r.t. the z ’s by δ : j = ∂ L δ l ∂ z l j Note that δ l has the same size as z l and a l Machine Learning: Chenhao Tan | Boulder | 19 of 39

  33. Back propagation | Back propagation Back Propagation For the derivation, we’ll consider a simplified network L ( y , a 2 ) z 1 | a 1 W 2 z 2 | a 2 a 0 W 1 Let’s compute δ L for output layer L : da L j = ∂ L = ∂ L j δ L ∂ z L ∂ a L dz L j j j Machine Learning: Chenhao Tan | Boulder | 19 of 39

Recommend


More recommend