feedforward neural nets
play

Feedforward neural nets CSE 250B Outline 1 Architecture 2 - PowerPoint PPT Presentation

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The architecture y h ( ` ) . . . h (2) h (1) x The value at a hidden unit h z 1 z 2 z m How is h computed from z 1 , . . . , z m ? The value at a


  1. Feedforward neural nets CSE 250B

  2. Outline 1 Architecture 2 Expressivity 3 Learning

  3. The architecture y h ( ` ) . . . h (2) h (1) x

  4. The value at a hidden unit h z 1 z 2 z m · · · How is h computed from z 1 , . . . , z m ?

  5. The value at a hidden unit h z 1 z 2 z m · · · How is h computed from z 1 , . . . , z m ? • h = σ ( w 1 z 1 + w 2 z 2 + · · · + w m z m + b ) • σ ( · ) is a nonlinear activation function , e.g. “rectified linear” � u if u ≥ 0 σ ( u ) = 0 otherwise

  6. Common activation functions • Threshold function or Heaviside step function � 1 if z ≥ 0 σ ( z ) = 0 otherwise • Sigmoid 1 σ ( z ) = 1 + e − z • Hyperbolic tangent σ ( z ) = tanh( z ) • ReLU (rectified linear unit) σ ( z ) = max(0 , z )

  7. Why do we need nonlinear activation functions? y h ( ` ) . . . h (2) h (1) x

  8. The output layer Classification with k labels: want k probabilities summing to 1. y 1 y 2 y k · · · z 3 z 1 z 2 z m · · ·

  9. The output layer Classification with k labels: want k probabilities summing to 1. y 1 y 2 y k · · · z 3 z 1 z 2 z m · · · • y 1 , . . . , y k are linear functions of the parent nodes z i . • Get probabilities using softmax : e y j Pr (label j ) = e y 1 + · · · + e y k .

  10. The complexity y h ( ` ) . . . h (2) h (1) x

  11. Outline 1 Architecture 2 Expressivity 3 Learning

  12. Approximation capability Let f : R d → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well.

  13. Approximation capability Let f : R d → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well. • The hidden layer may need a lot of nodes. • For certain classes of functions: • Either: one hidden layer of enormous size • Or: multiple hidden layers of moderate size

  14. Stone-Weierstrass theorem I If f : [ a , b ] → R is continuous then there is a sequence of polynomials P n such that P n has degree n and sup | P n ( x ) − f ( x ) | → 0 as n → ∞ . x ∈ [ a , b ]

  15. Stone-Weierstrass theorem II Let K ⊂ R d be some bounded set. Suppose there is a collection of functions A such that: • A is an algebra : closed under addition, scalar multiplication, and multiplication. • A does not vanish on K : for any x ∈ K , there is some h ∈ A with h ( x ) � = 0. • A separates points in K : for any x � = y ∈ K , there is some h ∈ A with h ( x ) � = h ( y ). Then for any continuous function f : K → R and any ǫ > 0, there is some h ∈ A with sup | f ( x ) − h ( x ) | ≤ ǫ. x ∈ K

  16. Example: exponentiated linear functions For domain K = R d , let A be all linear combinations of { e w · x + b : w ∈ R d , b ∈ R } . 1 Is an algebra. 2 Does not vanish. 3 Separates points.

  17. Variation: RBF kernels For domain K = R d , and any σ > 0, let A be all linear combinations of { e −� x − u � 2 /σ 2 : u ∈ R d } . Any continuous function is approximated arbitrarily well by A .

  18. A class of activation functions For domain K = R d , let A be all linear combinations of { σ ( w · x + b ) : w ∈ R d , b ∈ R } where σ : R → R is continuous and non-decreasing with � 1 if z → ∞ σ ( z ) → 0 if z → −∞ This also satisfies the conditions of the approximation result.

  19. Outline 1 Architecture 2 Expressivity 3 Learning

  20. Learning a net: the loss function Classification problem with k labels. • Parameters of entire net: W • For any input x , net computes probabilities of labels: Pr W (label = j | x )

  21. Learning a net: the loss function Classification problem with k labels. • Parameters of entire net: W • For any input x , net computes probabilities of labels: Pr W (label = j | x ) • Given data set ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ), loss function: n � ln Pr W ( y ( i ) | x ( i ) ) L ( W ) = − i =1 (also called cross-entropy ).

  22. Nature of the loss function L ( w ) L ( w ) w w

  23. Variants of gradient descent Initialize W and then repeatedly update. 1 Gradient descent Each update involves the entire training set. 2 Stochastic gradient descent Each update involves a single data point. 3 Mini-batch stochastic gradient descent Each update involves a modest, fixed number of data points.

  24. Derivative of the loss function Update for a specific parameter: derivative of loss function wrt that parameter. y h ( ` ) . . . h (2) h (1) x

  25. Chain rule 1 Suppose h ( x ) = g ( f ( x )), where x ∈ R and f , g : R → R . Then: h ′ ( x ) = g ′ ( f ( x )) f ′ ( x )

  26. Chain rule 1 Suppose h ( x ) = g ( f ( x )), where x ∈ R and f , g : R → R . Then: h ′ ( x ) = g ′ ( f ( x )) f ′ ( x ) 2 Suppose z is a function of y , which is a function of x . x y z Then: dz dx = dz dy dy dx

  27. A single chain of nodes A neural net with one node per hidden layer: · · · x = h 0 h 1 h 2 h 3 h ` For a specific input x , • h i = σ ( w i h i − 1 + b i ) • The loss L can be gleaned from h ℓ

  28. A single chain of nodes A neural net with one node per hidden layer: · · · x = h 0 h 1 h 2 h 3 h ` For a specific input x , • h i = σ ( w i h i − 1 + b i ) • The loss L can be gleaned from h ℓ To compute dL / dw i we just need dL / dh i : dL = dL dh i = dL σ ′ ( w i h i − 1 + b i ) h i − 1 dw i dh i dw i dh i

  29. Backpropagation • On a single forward pass, compute all the h i . • On a single backward pass, compute dL / dh ℓ , . . . , dL / dh 1 · · · x = h 0 h 1 h 2 h 3 h `

  30. Backpropagation • On a single forward pass, compute all the h i . • On a single backward pass, compute dL / dh ℓ , . . . , dL / dh 1 · · · x = h 0 h 1 h 2 h 3 h ` From h i +1 = σ ( w i +1 h i + b i +1 ), we have dL dL dh i +1 dL σ ′ ( w i +1 h i + b i +1 ) w i +1 = = dh i dh i +1 dh i dh i +1

  31. Two-dimensional examples What kind of net to use for this data?

  32. Two-dimensional examples What kind of net to use for this data? • Input layer: 2 nodes • One hidden layer: H nodes • Output layer: 1 node • Input → hidden: linear functions, ReLU activation • Hidden → output: linear function, sigmoid activation

  33. Example 1 How many hidden units should we use?

  34. Example 1 H = 2

  35. Example 1 H = 2

  36. Example 2 How many hidden units should we use?

  37. Example 2 H = 4

  38. Example 2 H = 4

  39. Example 2 H = 4

  40. Example 2 H = 4

  41. Example 2 H = 8: overparametrized

  42. Example 3 How many hidden units should we use?

  43. Example 3 H = 4

  44. Example 3 H = 8

  45. Example 3 H = 16

  46. Example 3 H = 16

  47. Example 3 H = 16

  48. Example 3 H = 32

  49. Example 3 H = 32

  50. Example 3 H = 32

  51. Example 3 H = 64

  52. Example 3 H = 64

  53. Example 3 H = 64

  54. PyTorch snippet Declaring and initializing the network: d, H = 2, 8 model = torch.nn.Sequential( torch.nn.Linear(d, H), torch.nn.ReLU(), torch.nn.Linear(H, 1), torch.nn.Sigmoid()) lossfn = torch.nn.BCELoss() A gradient step: ypred = model(x) loss = lossfn(ypred, y) model.zero grad() loss.backward() with torch.no grad(): for param in model.parameters(): param -= eta * param.grad

Recommend


More recommend