neural networks design
play

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49 Outline The Basics 1 Example:


  1. Output Distribution a ( 2 ) 0 2 3 1 0 2 3 1 � 1 1 0 0 2 3 � 1 1 1 0 1 a ( 2 ) = σ ( A ( 1 ) w ( 2 ) ) = σ B 6 7 C B 6 7 C A = σ 2 B 6 7 C B 6 7 C 4 5 1 1 0 1 @ 4 5 @ 4 5 A � 4 1 2 1 � 1 2 3 0 1 y = 1 ( a ( 2 ) > 0 . 5 ) = 6 7 ˆ 6 7 1 4 5 0 But how to train W ( 1 ) and w ( 2 ) from examples? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

  2. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 14 / 49

  3. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  4. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  5. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  6. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  7. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  8. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] = argmin Θ ∑ i � logP ( y ( i ) | x ( i ) , Θ ) = argmin Θ ∑ i C ( i ) ( Θ ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  9. Training an NN Given examples: X = { ( x ( i ) , y ( i ) ) } N i = 1 How to learn parameters Θ = { W ( 1 ) , ··· , W ( L ) } ? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmax Θ logP ( X | Θ ) = argmin Θ � logP ( X | Θ ) = argmin Θ ∑ i � logP ( x ( i ) , y ( i ) | Θ ) = argmin Θ ∑ i [ � logP ( y ( i ) | x ( i ) , Θ ) � logP ( x ( i ) | Θ )] = argmin Θ ∑ i � logP ( y ( i ) | x ( i ) , Θ ) = argmin Θ ∑ i C ( i ) ( Θ ) The minimizer ˆ Θ is an unbiased estimator of “true” Θ ⇤ Good for large N Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

  10. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  11. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  12. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] = � log [ σ (( 2 y ( i ) � 1 ) z ( L ) )] Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  13. Example: Binary Classification Pr ( y = 1 | x ) ⇠ Bernoulli ( ρ ) , where x 2 R D and y 2 { 0 , 1 } a ( L ) = ˆ ρ = σ ( z ( L ) ) the predicted distribution The cost function C ( i ) ( Θ ) can be written as: C ( i ) ( Θ ) = � logP ( y ( i ) | x ( i ) ; Θ ) = � log [( a ( L ) ) y ( i ) ( 1 � a ( L ) ) 1 � y ( i ) ] = � log [ σ ( z ( L ) ) y ( i ) ( 1 � σ ( z ( L ) )) 1 � y ( i ) ] = � log [ σ (( 2 y ( i ) � 1 ) z ( L ) )] = ζ (( 1 � 2 y ( i ) ) z ( L ) ) ζ ( · ) is the softplus function Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

  14. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  15. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  16. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  17. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  18. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  19. Optimization Algorithm Most NNs use SGD to solve the problem argmin Θ ∑ i C ( i ) ( Θ ) Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ ( 0 ) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M ; Θ ( t + 1 ) Θ ( t ) � η ∇ Θ ∑ M i = 1 C ( i ) ( Θ ( t ) ) ; } How to compute ∇ Θ ∑ i C ( i ) ( Θ ( t ) ) e ffi ciently? There could be a huge number of W ( k ) i , j ’s in Θ Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

  20. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 18 / 49

  21. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  22. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  23. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Back propagation (or simply backprop ) is an e ffi cient way to evaluate multiple partial derivatives at once Assuming the partial derivatives share some common evaluation steps Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  24. Back Propagation M Θ ( t + 1 ) Θ ( t ) � η ∇ Θ C ( n ) ( Θ ( t ) ) ∑ n = 1 We have ∇ Θ ∑ n C ( n ) ( Θ ( t ) ) = ∑ n ∇ Θ C ( n ) ( Θ ( t ) ) Let c ( n ) = C ( n ) ( Θ ( t ) ) , our goal is to evaluate ∂ c ( n ) ∂ W ( k ) i , j for all i , j , k , and n Back propagation (or simply backprop ) is an e ffi cient way to evaluate multiple partial derivatives at once Assuming the partial derivatives share some common evaluation steps By the chain rule, we have ∂ z ( k ) ∂ c ( n ) = ∂ c ( n ) j · ∂ W ( k ) ∂ z ( k ) ∂ W ( k ) i , j i , j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

  25. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  26. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  27. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Otherwise ( k > 1 ), we have z ( k ) = ∑ i W ( k ) i , j a ( k � 1 ) and j i ∂ z ( k ) = a ( k � 1 ) j i ∂ W ( 1 ) i , j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  28. Forward Pass ∂ z ( k ) j The second term: ∂ W ( k ) i , j When k = 1 , we have z ( 1 ) = ∑ i W ( 1 ) i , j x ( n ) and j i ∂ z ( 1 ) j = x ( n ) i ∂ W ( 1 ) i , j Otherwise ( k > 1 ), we have z ( k ) = ∑ i W ( k ) i , j a ( k � 1 ) and j i ∂ z ( k ) = a ( k � 1 ) j i ∂ W ( 1 ) i , j ∂ c ( n ) We can get the second terms of all ’s starting from the most ∂ W ( k ) i , j shallow layer Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

  29. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  30. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  31. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j When k = L , the evaluation varies from task to task Depending on the definition of functions act ( L ) and C ( n ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  32. Backward Pass I ∂ c ( n ) Conversely, we can get the first terms of all ’s starting from the ∂ W ( k ) i , j deepest layer Define error signal δ ( k ) as the first term ∂ c ( n ) j ∂ z ( k ) j When k = L , the evaluation varies from task to task Depending on the definition of functions act ( L ) and C ( n ) E.g., in binary classification, we have: δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ζ (( 1 � 2 y ( n ) ) z ( L ) ) = σ (( 1 � 2 y ( n ) ) z ( L ) ) · ( 1 � 2 y ( n ) ) ∂ z ( L ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

  33. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  34. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  35. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j ✓ ∂ ∑ i W ( k + 1 ) a ( k ) ◆ ∑ s δ ( k + 1 ) act 0 ( z ( k ) i , s = · i j ) s ∂ a ( k ) j Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  36. Backward Pass II When k < L , we have ∂ a ( k ) δ ( k ) · act 0 ( z ( k ) = ∂ c ( n ) = ∂ c ( n ) = ∂ c ( n ) · j j ) j ∂ z ( k ) ∂ a ( k ) ∂ z ( k ) ∂ a ( k ) j j j j ✓ ◆ · ∂ z ( k + 1 ) ∂ c ( n ) act 0 ( z ( k ) = j ) ∑ s s ∂ z ( k + 1 ) ∂ a ( k ) s j ✓ ∂ ∑ i W ( k + 1 ) a ( k ) ◆ ∑ s δ ( k + 1 ) act 0 ( z ( k ) i , s = · i j ) s ∂ a ( k ) j ⇣ ⌘ ∑ s δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) = j ) s j , s Theorem (Chain Rule) Let g : R ! R d and f : R d ! R , then f 2 g 0 3 1 ( x ) . ( f � g ) 0 ( x ) = f 0 ( g ( x )) g 0 ( x ) = ∇ f ( g ( x )) > . 5 . 6 7 . 4 g 0 n ( x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

  37. Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

  38. Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j The information propagate along a new kind of feedforward network: Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

  39. Backward Pass III ✓ ◆ δ ( k ) δ ( k + 1 ) · W ( k + 1 ) act 0 ( z ( k ) ∑ = j ) s j , s j s We can evaluate all δ ( k ) ’s starting from the deepest layer j The information propagate along a new kind of feedforward network: Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

  40. Backprop Algorithm (Minibatch Size M = 1 ) Input : ( x ( n ) , y ( n ) ) and Θ ( t ) Forward pass: x ( n ) ⇤ > ; a ( 0 ) ⇥ 1 for k 1 to L do z ( k ) W ( k ) > a ( k � 1 ) ; a ( k ) act ( z ( k ) ) ; end Backward pass: Compute error signal δ ( L ) (e.g., ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) in binary classification) for k L � 1 to 1 do δ ( k ) act 0 ( z ( k ) ) � ( W ( k + 1 ) δ ( k + 1 ) ) ; end ∂ W ( k ) = a ( k � 1 ) ⌦ δ ( k ) for all k ∂ c ( n ) Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 24 / 49

  41. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; A ( k ) act ( Z ( k ) ) ; end Backward pass: Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  42. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Backward pass: Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  43. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Large width ( D ( k ) ) Backward pass: at each layer Compute error signals δ ( L , M ) i > h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  44. Backprop Algorithm (Minibatch Size M > 1 ) Input : { ( x ( n ) , y ( n ) ) } M n = 1 and Θ ( t ) Forward pass: A ( 0 ) a ( 0 , M ) ⇤ > ; ⇥ a ( 0 , 1 ) ··· for k 1 to L do Z ( k ) A ( k � 1 ) W ( k ) ; Speed up with A ( k ) act ( Z ( k ) ) ; GPUs? end Large width ( D ( k ) ) Backward pass: at each layer Compute error signals δ ( L , M ) i > Large batch size h ∆ ( L ) = δ ( L , 0 ) ··· for k L � 1 to 1 do ∆ ( k ) act 0 ( Z ( k ) ) � ( ∆ ( k + 1 ) W ( k + 1 ) > ) ; end n = 1 a ( k � 1 , n ) ⌦ δ ( k , n ) for all k ∂ c ( n ) ∂ W ( k ) = ∑ M Return Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

  45. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 26 / 49

  46. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  47. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  48. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Di ff erent output units for di ff erent P ( y | x ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  49. Neuron Design The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ argmax Θ logP ( X | Θ ) = argmin i Universal cost function Di ff erent output units for di ff erent P ( y | x ) Gradient-based optimization: During SGD, the gradient ∂ z ( k ) ∂ z ( k ) ∂ c ( n ) = ∂ c ( n ) j = δ ( k ) j · j ∂ W ( k ) ∂ z ( k ) ∂ W ( k ) ∂ W ( k ) i , j i , j i , j j should be su ffi ciently large before we get a satisfactory NN Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

  50. Outline The Basics 1 Example: Learning the XOR Training 2 Back Propagation Neuron Design 3 Cost Function & Output Neurons Hidden Neurons Architecture Design 4 Architecture Tuning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 28 / 49

  51. Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

  52. Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i For NNs that output an entire distribution ˆ P ( y | x ) , the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: ⇥ log ˆ ⇤ � E ( x , y ) ⇠ Empirical ( X ) P ( y | x ) argmin ˆ P Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

  53. Negative Log Likelihood and Cross Entropy The cost function of most NNs: � logP ( y ( i ) | x ( i ) , Θ ) Θ ∑ Θ logP ( X | Θ ) = argmin argmax i For NNs that output an entire distribution ˆ P ( y | x ) , the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: ⇥ log ˆ ⇤ � E ( x , y ) ⇠ Empirical ( X ) P ( y | x ) argmin ˆ P Provides a consistent way to define output units Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

  54. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  55. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  56. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  57. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; or y ( n ) = 0 and z ( L ) is small negative Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  58. Sigmoid Units for Bernoulli Output Distributions In binary classification, we assuming P ( y = 1 | x ) ⇠ Bernoulli ( ρ ) y 2 { 0 , 1 } and ρ 2 ( 0 , 1 ) Sigmoid output unit: exp ( z ( L ) ) a ( L ) = ˆ ρ = σ ( z ( L ) ) = exp ( z ( L ) )+ 1 P ( y ( n ) | x ( n ) ; Θ ) δ ( L ) = ∂ c ( n ) ∂ z ( L ) = ∂ � log ˆ = ( 1 � 2 y ( n ) ) σ (( 1 � 2 y ( n ) ) z ( L ) ) ∂ z ( L ) Close to 0 only when y ( n ) = 1 and z ( L ) is large positive; or y ( n ) = 0 and z ( L ) is small negative The loss c ( n ) saturates (becomes flat) only when ˆ ρ is “correct” Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

  59. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  60. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  61. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  62. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) We can alternatively define K � 1 output units (discarding a ( L ) K = ˆ ρ K = 1 ): exp ( z ( L ) ) a ( L ) j = ˆ ρ j = j i = 1 exp ( z ( L ) ∑ K � 1 )+ 1 i that is a direct generalization of σ in binary classification Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  63. Softmax Units for Categorical Output Distributions I In multiclass classification, we can assume that P ( y | x ) ⇠ Categorical ( ρ ) , where y , ρ 2 R K and 1 > ρ = 1 Softmax units: exp ( z ( L ) ) a ( L ) j ρ j = sofmax ( z ( L ) ) j = = ˆ j i = 1 exp ( z ( L ) ∑ K ) i Actually, to define a Categorical distribution, we only need ρ 1 , ··· , ρ K � 1 ( ρ K = 1 � ∑ K � 1 i = 1 ρ i can be discarded) We can alternatively define K � 1 output units (discarding a ( L ) K = ˆ ρ K = 1 ): exp ( z ( L ) ) a ( L ) j = ˆ ρ j = j i = 1 exp ( z ( L ) ∑ K � 1 )+ 1 i that is a direct generalization of σ in binary classification In practice, the two versions make little di ff erence Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

  64. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  65. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  66. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  67. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i If y ( n ) = i 6 = j , then δ ( L ) = � ∂ log ˆ ρ i = � 1 ρ i ( � ˆ ρ i ˆ ρ j ) = ˆ ρ j j ∂ z ( L ) ˆ j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  68. Softmax Units for Categorical Output Distributions II Now we have ⇣ ρ 1 ( y ( n ) ; y ( n ) = i ) ⌘ P ( y ( n ) | x ( n ) ; Θ ) ∂ � log ∏ i ˆ = ∂ c ( n ) = ∂ � log ˆ i δ ( L ) = j ∂ z ( L ) ∂ z ( L ) ∂ z ( L ) j j j ⇣ ⌘ If y ( n ) = j , then δ ( L ) = � ∂ log ˆ ρ j = � 1 ρ 2 ρ j � ˆ ˆ = ˆ ρ j � 1 j ∂ z ( L ) ρ j ˆ j j δ ( L ) is close to 0 only when ˆ ρ j is “correct” j In this case, z ( L ) dominates among all z ( L ) ’s j i If y ( n ) = i 6 = j , then δ ( L ) = � ∂ log ˆ ρ i = � 1 ρ i ( � ˆ ρ i ˆ ρ j ) = ˆ ρ j j ∂ z ( L ) ˆ j Again, close to 0 only when ˆ ρ j is “correct” Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

  69. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

  70. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

  71. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression How to design output neurons if we want to predict the mean ˆ µ ? Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

  72. Linear Units for Gaussian Means An NN can also output just one conditional statistic of y given x For example, we can assume P ( y | x ) ⇠ N ( µ , Σ ) for regression How to design output neurons if we want to predict the mean ˆ µ ? Linear units: a ( L ) = ˆ µ = z ( L ) Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Recommend


More recommend