Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta
Output transformation
Output transformation What if targets y special? E.g. what if y nonnegative y ≥ 0 y probability y ∈ [0 , 1] y class indicator y ∈ {± 1 } Would like predictions ˆ y to respect same constraints Cannot do this with linear predictors Consider a new extension Nonlinear output transformation f such that range ( f ) = Y Notation and terminology z = x ′ w y = f (ˆ ˆ z ) where ˆ z = x ′ w ˆ “pre-prediction” y = f (ˆ ˆ z ) “post-prediction”
Nonlinear output transformation: Examples exp(x) Exponential 18 16 14 If y ≥ 0 use ˆ y = f (ˆ z ) = exp(ˆ z ) 12 10 8 6 4 2 0 1/(1+exp(−x)) Sigmoid −3 −2 −1 0 1 2 3 x 1 1 If y ∈ [0 , 1] use ˆ y = f (ˆ z ) = 0.8 1+exp( − ˆ z ) 0.6 0.4 0.2 0 sign(x) Sign −5 −4 −3 −2 −1 0 1 2 3 4 5 x 1 0.5 If y ∈ {± 1 } use ˆ y = f (ˆ z ) = sign (ˆ z ) 0 −0.5 −1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x
Nonlinear output transformation: Risk Combining arbitrary f with L can create local minima E.g. y − y ) 2 L (ˆ y ; y ) = (ˆ z )) − 1 f (ˆ z ) = σ (ˆ z ) = (1 + exp( − ˆ Objective � i ( σ ( X i : w ) − y i ) 2 is not convex in w Consider one training example 0.2 0 − 6 − 4 − 2 0 2 4 6 (Auer et al. NIPS-95) wx Local minima can combine 1.1 1 0.9 0.8 0.7 E 0.6
Nonlinear output transformation Possible to create exponentially many local minima t training examples can create ( t / n ) n local minima in n dimensions — locate t / n training examples along each dimension 0.62 0.6 0.58 0.56 0.54 0.52 0.5 0.48 -14 -14 -12 -12 -10 -10 -8 -8 log w2 log w1 -6 -6 -4 -4 -2 -2 0 0 From (Auer et al., NIPS-95)
Important idea: matching loss Assume f is continuous, differentiable, and strictly increasing Want to define L (ˆ y ; y ) so that L ( f (ˆ z ); y ) is convex in ˆ z Define matching loss by � ˆ z L ( f (ˆ z ); f ( z )) = f ( θ ) − f ( z ) d θ z = F ( θ ) | ˆ z − f ( z ) θ | ˆ z z z = F (ˆ z ) − F ( z ) − f ( z )(ˆ z − z ) where F ′ ( z ) = f ( z ); defines a Bregman divergence
Important idea: matching loss Properties F ′′ ( z ) = f ′ ( z ) > 0 since f strictly increasing ⇒ F strictly convex ⇒ F (ˆ z ) ≥ F ( z ) + f ( z )(ˆ z − z ) (convex function lies above tangent) ⇒ L ( f (ˆ z ); f ( z )) ≥ 0 and L ( f (ˆ z ); f ( z )) = 0 iff ˆ z = z
Matching loss: examples Identity transfer f ( z ) = z , F ( z ) = z 2 / 2, y = f ( z ) = z y − y ) 2 / 2 Get squared error: L (ˆ y ; y ) = (ˆ Exponential transfer f ( z ) = e z , F ( z ) = e z , y = f ( z ) = e z y ; y ) = y ln y Get unnormalized entropy error: L (ˆ y + ˆ y − y ˆ Sigmoid transfer f ( z ) = σ ( z ) = 1 / (1 + e − z ), F ( z ) = ln(1 + e z ), y = f ( z ) = σ ( z ) y ; y ) = y ln y y + (1 − y ) ln 1 − y Get cross entropy error: L (ˆ ˆ 1 − ˆ y
Matching loss Given suitable f Can derive a matching loss that ensures convexity of L ( f ( X w ); y ) Retain everything from before • efficient training • basis expansions • L 2 2 regularization → kernels • L 1 regularization → sparsity
Major problem remains: Classification If, say, y ∈ {± 1 } class indicator, use ˆ y = sign (ˆ z ) sign(x) 1 0.5 0 −0.5 −1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x Not continuous, differentiable, strictly increasing Cannot use matching loss construction Misclassification error � 0 if ˆ y = y L (ˆ y ; y ) = 1 (ˆ y � = y ) = 1 if ˆ y � = y
Classification
Classification y = sign ( x ′ w ) Consider geometry of linear classifiers ˆ w { x : x ′ w = 0 } y = sign ( x ′ w − b ) Linear classifiers with offset ˆ w u { x : x ′ w − b = 0 } b 2 w since u ′ w = b , u ′ w − b = 0 u = � w � 2
Classification Question Given training data X , y ∈ {± 1 } t can minimum misclassification error w be computed efficiently? Answer Depends
Classification Good news Yes, if data is linearly separable Linear program w , b , ξ 1 ′ ξ subject to ∆( y )( X w − 1 b ) ≥ 1 − ξ , ξ ≥ 0 min Returns ξ = 0 if data linearly separable Returns some ξ i > 0 if data not linearly separable
Classification Bad news No, if data not linearly separable NP-hard to solve � min 1 ( sign ( X i : w − b ) � = y i ) in general w i NP-hard even to approximate (H¨ offgen et al. 1995)
How to bypass intractability of learning linear classifiers? Two standard approaches 1. Use a matching loss to approximate sign (e.g. tanh transfer) 1.5 1 0.5 0 −0.5 −1 −1.5 −4 −3 −2 −1 0 1 2 3 4 2. Use a surrogate loss for training, sign for test
Approximating classification with a surrogate loss Idea Use a different loss ˜ L for training than the loss L used for testing Example Train on ˜ y − y ) 2 L (ˆ y ; y ) = (ˆ even though test on L (ˆ y ; y ) = 1 (ˆ y � = y ) Obvious weakness Regression losses like least squares penalize predictions that are “too correct”
Tailored surrogate losses for classification Margin losses For a given target y and pre-prediction ˆ z Definition The prediction margin is m = ˆ zy Note if ˆ zy = m > 0 then sign (ˆ z ) = y , zero misclassification if ˆ zy = m ≤ 0 then sign (ˆ z ) � = y , misclassification error 1 Definition a margin loss is a decreasing (nonincreasing) function of the margin
Margin losses Exponential margin loss ˜ z ; y ) = e − ˆ zy 3 L (ˆ 2.5 2 1.5 1 0.5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Binomial deviance ˜ z ; y ) = ln(1 + e − ˆ 2.5 zy ) L (ˆ 2 1.5 1 0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Margin losses Hinge loss (support vector machines) ˜ 3 L (ˆ z ; y ) = (1 − ˆ zy ) + = max(0 , 1 − ˆ zy ) 2.5 2 1.5 1 0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Robust hinge loss (intractable training) ˜ 2 L (ˆ z ; y ) = 1 − tanh(ˆ zy ) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −3 −2 −1 0 1 2 3
Margin losses Note Convex margin loss can provide efficient upper bound minimization for misclassification error Retain all previous extensions • efficient training • basis expansion • L 2 2 regularization → kernels • L 1 regularization → sparsity
Multivariate prediction
Multivariate prediction What if prediction targets y ′ are vectors ? For linear predictors, use a weight matrix W Given input x ′ , predict a vector y ′ = x ′ W ˆ 1 × k 1 × n n × k On training data, get prediction matrix ˆ Y = XW t × k t × n n × k W : j is the weight vector for j th output column W i : is vector of weights applied to i th feature Try to approximate target matrix Y
Multivariate linear prediction Need to define loss function between vectors y ; y ) = � y ℓ − y ℓ ) 2 E.g. L (ˆ ℓ (ˆ Given X , Y , compute t � min L ( X i : W ; Y i : ) W i =1 = min W L ( XW ; Y ) t � Note: using shorthand L ( XW ; Y ) = L ( X i : W ; Y i : ) i =1 Feature expansion X �→ Φ • Doesn’t change anything, can still solve same way as before • Will just use X and Φ interchangeably from now on
Multivariate prediction Can recover all previous developments • efficient training • feature expansion • L 2 2 regularization → kernels • L 1 regularization → sparsity • output transformations • matching loss • classification—surrogate margin loss
L 2 2 regularization—kernels W L ( XW ; Y ) + β 2 tr ( W ′ W ) min Still get representer theorem Solution satisfies W ∗ = X ′ A ∗ for some A ∗ Therefore still get kernels W L ( XW ; Y ) + β 2 tr ( W ′ W ) min A L ( XX ′ A ; Y ) + β 2 tr ( A ′ XX ′ A ) = min A L ( KA ; Y ) + β 2 tr ( A ′ KA ) = min Note We are actually regularizing using a matrix norm F = � � W � 2 ij W 2 ij = tr ( W ′ W ) Frobenius norm �� � ij W 2 tr ( W ′ W ) � W � F = ij =
Brief background: Recall matrix trace Definition For a square matrix A , tr ( A ) = � i A ii Properties tr ( A ) = tr ( A ′ ) tr ( aA ) = a tr ( A ) tr ( A + B ) = tr ( A ) + tr ( B ) tr ( A ′ B ) = tr ( B ′ A ) = � ij A ij B ij tr ( A ′ A ) = tr ( AA ′ ) = � ij A 2 ij tr ( ABC ) = tr ( CAB ) = tr ( BCA ) dW tr ( C ′ W ) = C d dW tr ( W ′ AW ) = ( A + A ′ ) W d
L 1 regularization—sparsity? We want sparsity in rows of W , not columns (that is, we want feature selection, not output selection) To achieve our goal need to select the right regularizer Consider the following matrix norms � L 1 norm � W � 1 = max j i | W ij | � L ∞ norm � W � ∞ = max i j | W ij | L 2 norm � W � 2 = σ max ( W ) (maximum singular value) � W � tr = � trace norm j σ j ( W ) (sum of singular values) � W � 2 , 1 = � 2 , 1 block norm i � W i : � �� �� ij W 2 j σ j ( W ) 2 Frobenius norm � W � F = ij = Which, if any, of these yield the desired sparsity structure?
Recommend
More recommend