deep networks
play

Deep networks CS 446 The ERM perspective These lectures will - PowerPoint PPT Presentation

Deep networks CS 446 The ERM perspective These lectures will follow an ERM perspective on deep networks: Pick a model/predictor class (network architecture). (We will spend most of our time on this!) Pick a loss/risk. (We will almost


  1. Deep networks CS 446

  2. The ERM perspective These lectures will follow an ERM perspective on deep networks: ◮ Pick a model/predictor class (network architecture). (We will spend most of our time on this!) ◮ Pick a loss/risk. (We will almost always use cross-entropy!) ◮ Pick an optimizer. (We will mostly treat this as a black box!) The goal is low test error, whereas above only gives low training error; we will briefly discuss this as well. 1 / 20

  3. 1. Linear networks.

  4. Iterated linear predictors The most basic view of a neural network is an iterated linear predictor. ◮ 1 layer: x �→ W 1 x + b 1 . ◮ 2 layers: x �→ W 2 ( W 1 x + b 1 ) + b 2 . ◮ 3 layers: � � x �→ W 3 W 2 ( W 1 x + b 1 ) + b 2 + b 3 . ◮ L layers: � � x �→ W L · · · ( W 1 x + b 1 ) · · · + b L . Alternatively, this is a composition of linear predictors: x �→ ( f L ◦ f L − 1 ◦ · · · ◦ f 1 ) ( x ) , where f i ( z ) = W i z + b i is an affine function. Note: “layer” terminology is ambiguous, we’ll revisit it. 2 / 20

  5. Wait a minute. . . Note that � � W L · · · ( W 1 x + b 1 ) · · · + b L = ( W L · · · W 1 ) x + ( b L + W L b L − 1 + · · · + W L · · · W 2 b 1 ) T [ x = w 1 ] , where w ∈ R d +1 is T 1: d = W L · · · W 1 , w d +1 = b L + W L b L − 1 + · · · + W L · · · W 2 b 1 . w Oops, this is just a linear predictor. 3 / 20

  6. 2. Activations/nonlinearities.

  7. Iterated logistic regression Recall that logistic regression could be interpreted as a probability model: 1 T x ) , Pr[ Y = 1 | X = x ] = 1 + exp( − w T x ) =: σ s ( w where σ s is the logistic or sigmoid function 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 4 / 20

  8. Iterated logistic regression Recall that logistic regression could be interpreted as a probability model: 1 T x ) , Pr[ Y = 1 | X = x ] = 1 + exp( − w T x ) =: σ s ( w where σ s is the logistic or sigmoid function 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 Now suppose σ s is applied coordinate-wise, and consider x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ s ( W i z + b i ) . Don’t worry, we’ll slow down next slide; for now, iterated logistic regression gave our first deep network! Remark: can view intermediate layers as features to subsequent layers. 4 / 20

  9. Basic deep networks A self-contained expression is � � � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L W L σ L − 1 , with equivalent “functional form” x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ i ( W i z + b i ) . Some further details (many more to come!): i =1 with W i ∈ R d i − 1 × d i are the weights, and ( b i ) L ◮ ( W i ) L i =1 are the biases. i =1 with σ i : R d i → R d i are called nonlinearties, or activations, or ◮ ( σ i ) L transfer functions, or link functions. ◮ This is only the basic setup; many things can and will change, please ask many questions! 5 / 20

  10. Choices of activation Basic form: � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L . Choices of activation (univariate, coordinate-wise): ◮ Indicator/step/heavyside/threshold z �→ 1 [ z ≥ 0] . This was the original choice (1940s!). ◮ Sigmoid σ s ( z ) := 1 1+exp( − z ) . This was popular roughly 1970s - 2005? ◮ Hyperbolic tangent z �→ tanh( z ) . Similar to sigmoid, used during same interval. ◮ Rectified Linear Unit (ReLU) σ r ( z ) = max { 0 , z } . It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominant choice now; popularized in “Imagenet/AlexNet” paper (Krizhevsky-Sutskever-Hinton, 2012). ◮ Identity z �→ z ; we’ll often use this as the last layer when we use cross-entropy loss. ◮ NON-coordinate-wise choices: we will discuss “softmax” and “pooling” a bit later. 6 / 20

  11. “Architectures” and “models” Basic form: � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L W L σ L − 1 . (( W i , b i )) L i =1 , the weights and biases, are the parameters. Let’s roll them into W := (( W i , b i )) L i =1 , and consider the network as a two-parameter function F W ( x ) = F ( x ; W ) . ◮ The model or class of functions is { F W : all possible W} . F (both arguments unset) is also called an architecture. ◮ When we fit/train/optimize, typically we leave the architecture fixed and vary W to minimize risk. (More on this in a moment.) 7 / 20

  12. ERM recipe for basic networks Standard ERM recipe: ◮ First we pick a class of functions/predictors; for deep networks, that means a F ( · , · ) . ◮ Then we pick a loss function and write down an empirical risk minimization problem; in these lectures we will pick cross-entropy: n 1 � � � arg min ℓ ce y i , F ( x i , W ) n W i =1 n 1 � � � y i , F ( x i ; (( W i , b i )) L = arg min ℓ ce i =1 ) n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL n 1 � = arg min � y i , σ L ( · · · σ 1 ( W 1 x i + b 1 ) · · · ) � ℓ ce . n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL ◮ Then we pick an optimizer. In this class, we only use gradient descent variants. It is a miracle that this works. 8 / 20

  13. Remark on affine expansion Note: we are writing � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · , rather than � � � �� � W 1 [ x x �→ σ L · · · W 2 σ 1 1 ] · · · . ◮ First form seems natural: With “iterated linear prediction” perspective, it is natural to append 1 at every layer. ◮ Second form is sufficient: with ReLU, σ r (1) = 1 , so can pass forward the constant; similar (but more complicated) options exist for other activations. ◮ Why do we do it? It seems to make the optimization better behaved; this is currently not well understood. 9 / 20

  14. Which architecture? How do choose an architecture? 10 / 20

  15. Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? 10 / 20

  16. Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? ◮ Split data into training and validation, train different architectures and evaluate them on validation, choose architecture with lowest validiation error. ◮ As with other methods, this is a proxy to minimizing test error. 10 / 20

  17. Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? ◮ Split data into training and validation, train different architectures and evaluate them on validation, choose architecture with lowest validiation error. ◮ As with other methods, this is a proxy to minimizing test error. Note. ◮ For many standard tasks (e.g., classification of standard vision datasets), people know good architectures. ◮ For new problems and new domains, things are absolutely not settled. 10 / 20

  18. 3. What we have gained: representation power

  19. Sometimes, linear just isn’t enough 1.00 1.00 0.75 0.75 0 -3.000 0 0 . 2 3 0.50 0.50 - -24.000 - 0.25 0.25 8 -1.500 . 0 0 0 0 0 0 -8.000 0.00 0.00 6 . 8.000 1 - 0.000 0 0.000 0 1.500 0 . 8 0.25 0.25 3.000 0.50 0.50 0.000 0.75 0.75 4.500 1.00 1.00 16.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Linear predictor: ReLU network: w T [ x x �→ 1 ] . x �→ W 2 σ r ( W 1 x + b 1 ) + b 2 . Some blue points misclassified. 0 misclassifications! 11 / 20

  20. Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. 12 / 20

  21. Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. 12 / 20

  22. Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. ◮ If it doesn’t split the blue points, then one halfspace contains the common midpoint, and therefore wrong on at least one red point. 12 / 20

  23. One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � ≤ ǫ, � � sup � f ( x ) − W 2 σ ( W 1 x + b 1 ) x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). 13 / 20

Recommend


More recommend