(Lec3-4.) Empirical risk minimization in matrix notation Define n × d matrix A and n × 1 column vector b by ← x T → y 1 1 1 1 . . A := √ n . , b := √ n . . . . ← x T → y n n Can write empirical risk as n � � 2 � R ( w ) = 1 � = � Aw − b � 2 T y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . In upcoming lecture we’ll prove every critical point of � R is a minimizer of � R . 9 / 61
(Lec3-4.) Full (factorization) SVD (new slide) Given M ∈ R n × d , let M = USV T denote the singular value decomposition (SVD) , where ◮ U ∈ R n × n is orthonormal, thus U T U = UU T = I , ◮ V ∈ R d × d is orthonormal, thus V T V = V V T = I , ◮ S ∈ R n × d has singular values s 1 ≥ s 2 ≥ · · · ≥ s min n,d along the diagonal and zeros elsewhere, where the number of positive singular values equals the rank of M . Some facts: ◮ SVD is not unique when the singular values are not distinct; e.g., we can write I = UIV T where U is any orthonormal matrix. ◮ Pseudoinverse S + ∈ R d × n of S is obtained by starting with S T and taking the reciprocal of each positive entry. ◮ Pseudoinverse of M is V S + U T . ◮ If M − 1 exists, then M − 1 = M + . 10 / 61
(Lec3-4.) Thin (decomposition) SVD (new slide) Given M ∈ R n × d , ( s, u , v ) are a singular value with corresponding left and right singular vectors if Mv = s u and M T u = s v . The thin SVD of M is M = � r i =1 s i u i v T i , where r is the rank of M , and ◮ left singular vectors ( u 1 , . . . , u r ) are orthonormal (but we might have r < min { n, d } !) and span the column space of M , ◮ right singular vectors ( v 1 , . . . , v r ) are orthonormal (but we might have r < min { n, d } !) and span the row space of M , ◮ sigular values s 1 ≥ · · · ≥ s r > 0 . Some facts: ◮ Pseudoinverse M + = � r s i v i u T 1 i . i =1 ◮ ( u i ) r i =1 span t 11 / 61
(Lec3-4.) SVD and least squares Recall: we’d like to find w such that T Aw = A T b . A If w = A + b , then � r � r � r 1 T Aw = T T T b s i v i u s i u i v A s i v i u i i i i =1 i =1 i =1 r r � � T T b = A T b . = s i v i u u i u i i i =1 i =1 w ols = A + b as the OLS solution. Henceforth, define ˆ (OLS = “ordinary least squares”.) Note : in general, AA + = � r i =1 u i u T i � = I . 12 / 61
(Lec3-4.) Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. 13 / 61
(Lec3-4.) Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i , � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤ s 2 ( w ′ − w ) , T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . 13 / 61
(Lec3-4.) Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i , � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤ s 2 ( w ′ − w ) , T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . (We’ll revisit all this with convexity later.) 13 / 61
(Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . 14 / 61
(Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! 14 / 61
(Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . 14 / 61
(Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . ◮ Parameter λ controls how much attention is paid to the regularizer � w � 2 2 relative to the data fitting term � R ( w ) . 14 / 61
(Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . ◮ Parameter λ controls how much attention is paid to the regularizer � w � 2 2 relative to the data fitting term � R ( w ) . ◮ Choose λ using cross-validation. 14 / 61
(Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . ◮ Parameter λ controls how much attention is paid to the regularizer � w � 2 2 relative to the data fitting term � R ( w ) . ◮ Choose λ using cross-validation. Note : in deep networks, this regularization is called “weight decay”. (Why?) Note : another popular regularizer for linear regression is ℓ 1 . 14 / 61
(Lec5-6.) Geometry of linear classifiers x 2 A hyperplane in R d is a linear subspace of dimension d − 1 . ◮ A hyperplane in R 2 is a line. H ◮ A hyperplane in R 3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. w x 1 A hyperplane H can be specified by a (non-zero) normal vector w ∈ R d . The hyperplane with normal vector w is the set of points orthogonal to w : � � x ∈ R d : x T w = 0 H = . Given w and its corresponding H : H splits the sets labeled positive { x : w T x > 0 } and those labeled negative { x : w T w < 0 } . 15 / 61
(Lec5-6.) Classification with a hyperplane H w span { w } 16 / 61
(Lec5-6.) Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) 16 / 61
(Lec5-6.) Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w 16 / 61
(Lec5-6.) Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin? 16 / 61
(Lec5-6.) Linear separability Is it always possible to find w with sign( w T x i ) = y i ? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.) 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Linearly separable. Not linearly separable. 17 / 61
(Lec5-6.) Cauchy-Schwarz (new slide) Cauchy-Schwarz inequality. | a T b | ≤ � a � · � b � . 18 / 61
(Lec5-6.) Cauchy-Schwarz (new slide) Cauchy-Schwarz inequality. | a T b | ≤ � a � · � b � . Proof. If � a � = � b � , 0 ≤ � a − b � 2 = � a � 2 − 2 a T b + � b � 2 = 2 � a � · � b � − 2 a T b , which rearranges to give a T b ≤ � a � · � b � . � �� a T � For the case � a � < � b � , apply the preceding to � b � � a � b . � a � � b � For the absolute value, apply the preceding to ( a , − b ) . � 18 / 61
(Lec5-6.) Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 19 / 61
(Lec5-6.) Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 19 / 61
(Lec5-6.) Squared and logistic losses on linearly separable data I 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 0.400 0.800 1.200 8 4 0 4 8 2 2 . . . . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 20 / 61
(Lec5-6.) Squared and logistic losses on linearly separable data II 1.0 1.0 0.8 0.8 0.6 0.6 - - 0 1 8 4 1 - . - . . 1 0 . - 0 0 8 2 . 0 - 0 0 0 2 4 0 2 0 . . 0 . 0 . 0 . . 0 . 0 0 0 8 . 4 0 4 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 21 / 61
(Lec5-6.) Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. 22 / 61
(Lec5-6.) Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. If there exists ¯ w with y i ¯ w T x i > 0 for all i , then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . 22 / 61
(Lec5-6.) Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. If there exists ¯ w with y i ¯ w T x i > 0 for all i , then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . Proof. Omitted. 22 / 61
(Lec5-6.) Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . 23 / 61
(Lec5-6.) Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n 23 / 61
(Lec5-6.) Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n Remark . Is A + b a “closed form expression”? 23 / 61
(Lec5-6.) Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 24 / 61
(Lec5-6.) Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? 24 / 61
(Lec5-6.) Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity. 24 / 61
(Lec5-6.) Multiclass? All our methods so far handle multiclass: ◮ k -nn and decision tree: plurality label. W ∈ R d × k � AW − B � 2 ◮ Least squares: arg min F with B ∈ R n × k ; W ∈ R d × k is k separate linear regressors in R d . How about linear classifiers? ◮ At prediction time, x �→ arg max y ˆ f ( x ) y . ◮ As in binary case: interpretation f ( x ) y = Pr[ Y = y | X = x ] . What is a good loss function? 25 / 61
(Lec5-6.) Cross-entropy ≥ 0 : � Given two probability vectors p , q ∈ ∆ k = { p ∈ R k i p i = 1 } , k � H ( p , q ) = − p i ln q i (cross-entropy) . i =1 ◮ If p = q , then H ( p , q ) = H ( p ) (entropy); indeed � � � k p i q i H ( p , q ) = − p i ln = H ( p ) + KL ( p , q ) . p i � �� � � �� � i =1 entropy KL divergence Since KL ≥ 0 and moreover 0 iff p = q , this is the cost/entropy of p plus a penalty for differing. ◮ Choose encoding ˜ y i = e y for y ∈ { 1 , . . . , k } , y ∝ exp( f ( x )) with f : R d → R k ; and ˆ k � exp( f ( x ) i ) ℓ ce (˜ y , f ( x )) = H (˜ y , ˆ y ) = − ˜ y i ln � k j =1 exp( f ( x ) j ) i =1 � k exp( f ( x ) y ) = − f ( x ) y + ln = − ln exp( f ( x ) j ) . � k j =1 exp( f ( x ) j ) j =1 (In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y) .) 26 / 61
(Lec5-6.) Cross-entropy, classification, and margins The zero-one loss for classification is � � ℓ zo ( y i , f ( x )) = 1 y i � = arg max f ( x ) j . j In the multiclass case, can define margin as f ( x ) y − max j � = y f ( x ) j , interpreted as “the distance by which f is correct”. (Can be negative!) Since ln � j z j ≈ max j z j , cross-entropy satisfies � � � y i , f ( x )) = − f ( x ) y + ln ℓ ce (˜ exp f ( x ) j j ≈ − f ( x ) y + max f ( x ) j , j thus minimizing cross-entropy maximizes margins. 27 / 61
(Lec7-8.) The ERM perspective These lectures will follow an ERM perspective on deep networks: ◮ Pick a model/predictor class (network architecture). (We will spend most of our time on this!) ◮ Pick a loss/risk. (We will almost always use cross-entropy!) ◮ Pick an optimizer. (We will mostly treat this as a black box!) The goal is low test error, whereas above only gives low training error; we will briefly discuss this as well. 28 / 61
(Lec7-8.) Basic deep networks A self-contained expression is � � � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L , with equivalent “functional form” x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ i ( W i z + b i ) . Some further details (many more to come!): i =1 with W i ∈ R d i × d i − 1 are the weights, and ( b i ) L ◮ ( W i ) L i =1 are the biases. i =1 with σ i : R d i → R d i are called nonlinearties, or activations, or ◮ ( σ i ) L transfer functions, or link functions. ◮ This is only the basic setup; many things can and will change, please ask many questions! 29 / 61
(Lec7-8.) Choices of activation Basic form: � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · W L σ L − 1 + b L . Choices of activation (univariate, coordinate-wise): ◮ Indicator/step/heavyside/threshold z �→ 1 [ z ≥ 0] . This was the original choice (1940s!). ◮ Sigmoid σ s ( z ) := 1 1+exp( − z ) . This was popular roughly 1970s - 2005? ◮ Hyperbolic tangent z �→ tanh( z ) . Similar to sigmoid, used during same interval. ◮ Rectified Linear Unit (ReLU) σ r ( z ) = max { 0 , z } . It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominant choice now; popularized in “Imagenet/AlexNet” paper (Krizhevsky-Sutskever-Hinton, 2012). ◮ Identity z �→ z ; we’ll often use this as the last layer when we use cross-entropy loss. ◮ NON-coordinate-wise choices: we will discuss “softmax” and “pooling” a bit later. 30 / 61
(Lec7-8.) “Architectures” and “models” Basic form: � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L . (( W i , b i )) L i =1 , the weights and biases, are the parameters. Let’s roll them into W := (( W i , b i )) L i =1 , and consider the network as a two-parameter function F W ( x ) = F ( x ; W ) . ◮ The model or class of functions is { F W : all possible W} . F (both arguments unset) is also called an architecture. ◮ When we fit/train/optimize, typically we leave the architecture fixed and vary W to minimize risk. (More on this in a moment.) 31 / 61
(Lec7-8.) ERM recipe for basic networks Standard ERM recipe: ◮ First we pick a class of functions/predictors; for deep networks, that means a F ( · , · ) . ◮ Then we pick a loss function and write down an empirical risk minimization problem; in these lectures we will pick cross-entropy: n � � � 1 arg min ℓ ce y i , F ( x i , W ) n W i =1 n � � � 1 y i , F ( x i ; (( W i , b i )) L = arg min ℓ ce i =1 ) n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL n � � � 1 = arg min ℓ ce y i , σ L ( · · · σ 1 ( W 1 x i + b 1 ) · · · ) . n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL ◮ Then we pick an optimizer. In this class, we only use gradient descent variants. It is a miracle that this works. 32 / 61
(Lec7-8.) Sometimes, linear just isn’t enough 1.00 1.00 0.75 0.75 0 -3.000 0 0 . 2 3 0.50 0.50 - -24.000 - 0.25 0.25 8 -1.500 . 0 0 0 0 0 0 -8.000 0.00 0.00 6 . 8.000 1 - 0.000 0 0.000 0 1.500 0 . 8 0.25 0.25 3.000 0.50 0.50 0.000 0.75 0.75 4.500 1.00 1.00 16.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Linear predictor: ReLU network: w T [ x �→ x �→ W 2 σ r ( W 1 x + b 1 ) + b 2 . x 1 ] . Some blue points misclassified. 0 misclassifications! 33 / 61
(Lec7-8.) Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. 34 / 61
(Lec7-8.) Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. 34 / 61
(Lec7-8.) Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. ◮ If it doesn’t split the blue points, then one halfspace contains the common midpoint, and therefore wrong on at least one red point. 34 / 61
(Lec7-8.) One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � � � ≤ ǫ, � f ( x ) − W 2 σ ( W 1 x + b 1 ) sup x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). 35 / 61
(Lec7-8.) One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � � � ≤ ǫ, � f ( x ) − W 2 σ ( W 1 x + b 1 ) sup x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). Remarks. ◮ Together with XOR example, justifies using nonlinearities. ◮ Does not justify (very) deep networks. ◮ Only says these networks exist , not that we can optimize for them! 35 / 61
(Lec7-8.) General graph-based view Classical graph-based perspective. ◮ Network is a directed acyclic graph; sources are inputs, sinks are outputs, intermediate nodes compute z �→ σ ( w T z + b ) (with their own ( σ, w , b ) ). ◮ Nodes at distance 1 from inputs are the first layer, distance 2 is second layer, and so on. “Modern” graph-based perspective. ◮ Edges in the graph can be multivariate, meaning vectors or general tensors, and not just scalars. ◮ Edges will often “skip” layers; “layer” is therefore ambiguous. ◮ Diagram conventions differ; e.g., tensorflow graphs include nodes for parameters. 36 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61
(Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61
(Lec7-8.) Softmax Replace vector input z with z ′ ∝ e z , meaning � � e z 1 e z k z �→ � j e z j , . . . , � j e z j , . ◮ Converts input into a probability vector; useful for interpreting output network output as Pr[ Y = y | X = x ] . ◮ We have baked it into our cross-entropy definition; last lectures networks with cross-entropy training had implicit softmax. ◮ If some coordinate j of z dominates others, then softmax is close to e j . 38 / 61
(Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61
(Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61
(Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61
(Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) ◮ Often used together with convolution layers; shrinks/downsamples the input. ◮ Another variant is average pooling. ◮ Implementation: torch.nn.MaxPool2d . 39 / 61
(Lec9-10.) Multivariate network single-example gradients Define G j ( W ) = σ j ( W j · · · σ 1 ( W 1 x + b ) · · · ) . The multivariate chain rule tells us T x T , ∇ W F ( W x ) = J and J ∈ R l × k is the Jacobian matrix of F : R k → R l at W x , the matrix of all coordinate-wise derivatives. d G L T T d W L = J L G L − 1 ( W ) d G L T d b L = J L , 40 / 61
(Lec9-10.) Multivariate network single-example gradients Define G j ( W ) = σ j ( W j · · · σ 1 ( W 1 x + b ) · · · ) . The multivariate chain rule tells us T x T , ∇ W F ( W x ) = J and J ∈ R l × k is the Jacobian matrix of F : R k → R l at W x , the matrix of all coordinate-wise derivatives. d G L T T d W L = J L G L − 1 ( W ) d G L T d b L = J L , . . . d G L T G j − 1 ( W ) T , d W j = ( J L W L J L − 1 W L − 1 · · · J j ) d G L T , d b j = ( J L W L J L − 1 W L − 1 · · · J j ) 40 / 61
(Lec9-10.) Multivariate network single-example gradients Define G j ( W ) = σ j ( W j · · · σ 1 ( W 1 x + b ) · · · ) . The multivariate chain rule tells us T x T , ∇ W F ( W x ) = J and J ∈ R l × k is the Jacobian matrix of F : R k → R l at W x , the matrix of all coordinate-wise derivatives. d G L T T d W L = J L G L − 1 ( W ) d G L T d b L = J L , . . . d G L T G j − 1 ( W ) T , d W j = ( J L W L J L − 1 W L − 1 · · · J j ) d G L T , d b j = ( J L W L J L − 1 W L − 1 · · · J j ) with J j as the Jacobian of σ j at W j G j − 1 ( W ) + b j . For example, with σ j that is coordinate-wise σ : R → R , � �� �� σ ′ �� � � � , . . . , σ ′ W j G j − 1 ( W ) + b j W j G j − 1 ( W ) + b j J j is diag . 1 d j 40 / 61
(Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? 41 / 61
(Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? 41 / 61
(Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? ◮ Resolving this issue is called symmetry breaking. 41 / 61
(Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? ◮ Resolving this issue is called symmetry breaking. ◮ Standard linear/dense layer initializations: N (0 , 2 d in ) “He et al.”, 2 N (0 , d in + d out ) Glorot/Xavier, 1 1 U ( − √ d in √ d in , ) torch default. (Convolution layers adjusted to have similar distributions.) Random initialization is emerging as a key story in deep networks! 41 / 61
(Lec9-10; SGD slide.) Minibatches We used the linearity of gradients to write � n R ( w ) = 1 ∇ w � ∇ w ℓ ( F ( x i ; w ) , y i ) . n i =1 What happens if we replace (( x i , y i )) n i =1 with minibatch (( x ′ i , y ′ i )) b i =1 ? ◮ Random minibatch = ⇒ two gradients equal in expectation. ◮ Most torch layers take minibatch input: ◮ torch.nn.Linear has input shape ( b, d ) , output ( b, d ′ ) . ◮ torch.nn.Conv2d has input shape ( b, c, h, w ) , output ( b, c ′ , h ′ , w ′ ) . ◮ This is used heavily outside deep learning as well. It is an easy way to use parallel floating point operations (as in GPU and CPU). ◮ Setting batch size is black magic and depends on many things (prediction problem, gpu characteristics, . . . ). 42 / 61
(Lec9-10.) Convex sets A set S is convex if, for every pair of points { x , x ′ } in S , the line segment between x and x ′ is also contained in S . ( { x , x ′ } ∈ S = ⇒ [ x , x ′ ] ∈ S .) convex not convex convex convex 43 / 61
(Lec9-10.) Convex sets A set S is convex if, for every pair of points { x , x ′ } in S , the line segment between x and x ′ is also contained in S . ( { x , x ′ } ∈ S = ⇒ [ x , x ′ ] ∈ S .) convex not convex convex convex Examples : ◮ All of R d . ◮ Empty set. ◮ Half-spaces: { x ∈ R d : a T x ≤ b } . ◮ Intersections of convex sets. � � � � = � m x ∈ R d : Ax ≤ b x ∈ R d : a T ◮ Polyhedra: i x ≤ b i . i =1 ◮ Convex hulls: conv( S ) := { � k i =1 α i x i : k ∈ N , x i ∈ S, α i ≥ 0 , � k i =1 α i = 1 } . (Infinite convex hulls: intersection of all convex supersets.) 43 / 61
(Lec9-10.) Convex functions from convex sets The epigraph of a function f is the area above the curve: � � ( x , y ) ∈ R d +1 : y ≥ f ( x ) epi( f ) := . A function is convex if its epigraph is convex. f is not convex f is convex 44 / 61
(Lec9-10.) Convex functions (standard definition) A function f : R d → R is convex if for any x , x ′ ∈ R d and α ∈ [0 , 1] , f ((1 − α ) x + α x ′ ) ≤ (1 − α ) · f ( x ) + α · f ( x ′ ) . x x ′ x x ′ f is not convex f is convex 45 / 61
Recommend
More recommend