Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization Krzysztof Dembczyński 1 , 2 , Wojciech Kotłowski 1 , 3 1 Institute of Computing Science, Poznań University of Technology 2 KEBI, Philipps-Universit¨ at in Marburg 3 Centrum Wiskunde & Informatica, Amsterdam PL-09, Bled, September 11, 2009
1 Ordinal Classification 2 RankRules 3 Conclusions
1 Ordinal Classification 2 RankRules 3 Conclusions
Ordinal classification consists in predicting a label ta- ken from a finite and ordered set for an object described by some attributes . This problem shares some characteristics of multi-class classification and regression , but: • the order between class labels cannot be neglected , • the scale of the decision attribute is not cardinal .
Recommender system predicting a rating of a movie for a gi- ven user.
Email filtering to ordered groups like: important, normal, later, or spam.
Denotation : • K – number of classes • y – actual label • x – attributes • ˆ y – predicted label • F ( x ) – prediction function • f ( x ) – ranking or utility function • θ = ( θ 0 , . . . , θ K ) – thresholds • L ( · ) – loss function • � · � – Boolean test • { y i , x i } N 1 – training examples
Ordinal Classification : • Since y is discrete, it obeys a multinomial distribution for a given x : p k ( x ) = Pr( y = k | x ) , k = 1 , . . . , K. • The optimal prediction is clearly given by: K y ∗ = F ∗ ( x ) = arg min � ˆ p k ( x ) L ( y, F ( x )) , F ( x ) k =1 where L ( y, ˆ y ) is the loss function defined as a matrix : L ( y, ˆ y ) = ( l y, ˆ y ) K × K with v-shaped rows and zeros on diagonal. 0 1 2 L ( y, ˆ y ) = 1 0 1 2 1 0
Ordinal Classification : • A natural choice of the loss matrix is the absolute-error loss for which l y, ˆ y = | y − ˆ y | . • The optimal prediction in this case is median over class distribution: F ∗ ( x ) = median p k ( x ) ( y ) . • Median does not depend on a distance between class labels , so the scale of the decision attribute does not matter; the order of labels is taken into consideration only.
Two Approaches to Ordinal Classification : • Threshold Loss Minimization (SVOR, ORBoost-All, MMMF), • Rank Loss Minimization (RankSVM, RankBoost). In both approaches, one assumes existence of: • ranking (or utility ) function f ( x ) , and • consecutive thresholds θ = ( θ 0 , . . . , θ K ) on a range of the ranking function, and the final prediction is given by: K � F ( x ) = k � f ( x ) ∈ [ θ k − 1 , θ k ) � . k =1
Threshold Loss Minimization : • Threshold loss function is defined by: K − 1 � L ( y, f ( x ) , θ ) = � y k ( f ( x ) − θ k ) � 0 � , k =1 where y k = 1 , if y > k, and y k = − 1 , otherwise . θ 0 = −∞ ... θ 1 = − 3.5 θ 2 = − 1.2 θ k − 1 = = 1.2 θ k − θ − 2 = 3.8 θ K = ∞ ... ... −5 −4 −3 −2 −1 0 1 2 3 4 5 f ( x )
Rank Loss Minimization : • Rank loss function is defined over pairs of objects: L ( y ◦• , f ( x ◦ ) , f ( x • )) = � y ◦• ( f ( x ◦ ) − f ( x • )) � 0 � , where y ◦• = sgn( y ◦ − y • ) . • Thresholds are computed afterwards with respect to a given loss matrix . y i 1 > y i 2 > y i 3 > . . . > y i N − 1 > y i N f ( x i 1 ) > f ( x i 3 ) > f ( x i 2 ) > . . . > f ( x i N − 1 ) > f ( x i N )
Comparison of the two approaches : Threshold loss: • Comparison of an object to thresholds instead to all other training objects . • Weighted threshold loss can approximate any loss matrix. Rank loss: • Minimization of the rank loss on training set has quadratic complexity with respect to a number of object, however, in the case of K ordered classes, the algorithm can work in linear time . • Rank loss minimization is closely related to maximization of AUC criterion .
1 Ordinal Classification 2 RankRules 3 Conclusions
RankRules : • Ranking function is an ensemble of decision rules : M � f ( x ) = r m ( x ) , m =1 where r m ( x ) = α m Φ m ( x ) is a decision rule defined by a response α m ∈ R , and an axis-parallel region in attribute space Φ m ( x ) ∈ { 0 , 1 } . • Decision rule can be seen as logical pattern : if [condition] then [decision].
RankRules : • RankRules follows the rank loss minimization. • We use the boosting approach to learn the ensemble. • The rank loss is upper-bounded by the exponential function: L ( y, f ) = exp( − yf ) . • This is a convex function, which makes the minimization process easier to cope with. • Due to modularity of the exponential function, minimization of the rank loss can be performed in a fast way.
RankRules : • In the m -th iteration , the rule is computed by: � w ij e − α (Φ m ( x i ) − Φ m ( x j )) , r m = arg min Φ ,α y ij > 0 where f m − 1 is rule ensemble after m − 1 iterations, and w ij = e − ( f m − 1 ( x i ) − f m − 1 ( x j )) can be treated as weights associated with pairs of training examples. • The overall loss changes only for pairs in which one example is covered by the rule and the other is not ( Φ( x i ) � = Φ( x j ) ).
RankRules : • Thresholds are computed by: N K − 1 � � e − y ik ( f ( x i ) − θ k ) , θ = arg min θ i =1 k =1 subject to θ 0 = −∞ � θ 1 � . . . � θ K − 1 � θ K = ∞ . • The problem has a closed-form solution :: � N i =1 � y ik > 0 � e f ( x i ) θ k = 1 2 log i =1 � y ik < 0 � e − f ( x i ) , k = 1 , . . . , K − 1 . � N • The monotonicity condition is satisfied by this solution as proved by Lin and Li (2007).
Single Rule Generation : • The m -th rule is obtained by solving: � w ij e − α (Φ m ( x i ) − Φ m ( x j )) . r m = arg min Φ ,α y ij > 0 • For given Φ m the problem of finding α m has a closed-form solution : � y ij > 0 ∧ Φ m ( x i ) > Φ m ( x j ) w ij α m = 1 2 ln . � y ij > 0 ∧ Φ m ( x i ) < Φ m ( x j ) w ij • The challenge is to find Φ m by deriving the impurity measure L (Φ m ) in such a way that the optimization problem does not longer depend on α m .
Boosting Approaches and Impurity Measures : • Simultaneous minimization : finds the closed-form solution for Φ (Confidence-rated AdaBoost, SLIPPER, RankBoost). • Gradient descent : relies on approximation of the loss function up to the first order (AdaBoost, AnyBoost). • Gradient boosting : minimizes the squared-error between rule outputs and the negative gradient of the loss function (Gradient Boosting Machine, MART). • Constant-step minimization : restricts α ∈ {− β, β } , with β being a fixed parameter.
Boosting Approaches and Impurity Measures : • Each of the boosting approaches provides another impurity measure that represents different trade-off between misclassification and coverage of the rule. • Gradient descent produces the most general rules in comparison to other techniques. • Gradient descent represents 1 2 trade-off between misclassification and coverage of the rule. • Constant-step minimization generalizes the gradient descent technique to obtain different trade-offs between misclassification and coverage of the rule, namely ℓ ∈ [0 , 0 . 5) , with β = ln 1 − ℓ . ℓ
Rule Coverage (artificial data) 500 Number of covered training examples RR SM−Exp ν ν = = 0.1 ζ ζ = = 0.25 RR CS−Exp β = 0.1 ν = 0.1 ζ = 0.25 β = ν = ζ = RR CS−Exp β = 0.2 ν = 0.1 ζ = 0.25 β = ν = ζ = 400 RR CS−Exp β = 0.5 ν = 0.1 ζ = 0.25 β = ν = ζ = RR GD−Exp β β = = 0 ν ν = = 0.1 ζ ζ = = 0.25 RR GB−Exp ν ν = = 0.1 ζ ζ = = 0.25 300 200 100 0 0 200 400 600 800 1000 Rule
Fast Implementation : • We rewrite the minimization problem of complexity O ( N 2 ) : � w ij e − α (Φ m ( x i ) − Φ m ( x j )) , r m = arg min Φ ,α y ij > 0 to the problem that can be solved in O ( KN ) . • We use the fact that w ij = e − ( f m − 1 ( x i ) − f m − 1 ( x j )) = e − f m − 1 ( x i ) e f m − 1 ( x j ) = w i w − j , and use denotation: W 0 � w − � w − W k = i , k = i . y i = k ∧ Φ( x i )=1 y i = k ∧ Φ( x i )=0
Fast Implementation : • The minimization problem can be rewritten to N � w i e − α (Φ m ( x i )) � j e α Φ m ( x j ) , w − r m = arg min Φ ,α i =1 y i >y j where the inner sum can be given by: j e α Φ m ( x j ) = e α � W 0 � � w − W k + k . y i >y j y i >k y i >k • The values W 0 W k and k , k = 1 , . . . , K, can be easily computed and updated in each iteration.
Fast Implementation 800 RR SM−Exp ν = = 0.1 ζ ζ = = 1 RR SM−Exp ν = = 0.1 ζ ζ = = 0.5 600 Time 400 200 0 0 2000 4000 6000 8000 10000 Number of training instances
Regularization : • The rule is shrinked (multiplied) by the amount ν ∈ (0 , 1] towards rules already present in the ensemble: f m ( x ) = f m − 1 ( x ) + ν · r m ( x ) . • Procedure for finding Φ m works on a fraction ζ of original data, drawn without replacement. • Value of α m is calculated on all training examples – this usually decreases | α m | and plays the role of regularization .
Recommend
More recommend