Decision Rule-based Algorithm for Ordinal Classification based on - PowerPoint PPT Presentation

Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization Krzysztof Dembczyński 1 , 2 , Wojciech Kotłowski 1 , 3 1 Institute of Computing Science, Poznań University of Technology 2 KEBI, Philipps-Universit¨ at in Marburg 3 Centrum Wiskunde & Informatica, Amsterdam PL-09, Bled, September 11, 2009

1 Ordinal Classification 2 RankRules 3 Conclusions

Ordinal classification consists in predicting a label taken from a finite and ordered set for an object described by some attributes . This problem shares some characteristics of multi-class classification and regression , but: • the order between class labels cannot be neglected , • the scale of the decision attribute is not cardinal .

Recommender system predicting a rating of a movie for a given user.

Email filtering to ordered groups like: important, normal, later, or spam.

Denotation : • K – number of classes • y – actual label • x – attributes • ˆ y – predicted label • F ( x ) – prediction function • f ( x ) – ranking or utility function • θ = ( θ 0 , . . . , θ K ) – thresholds • L ( · ) – loss function • � · � – Boolean test • { y i , x i } N 1 – training examples

Ordinal Classification : • Since y is discrete, it obeys a multinomial distribution for a given x : p k ( x ) = Pr( y = k | x ) , k = 1 , . . . , K. • The optimal prediction is clearly given by: K y ∗ = F ∗ ( x ) = arg min � ˆ p k ( x ) L ( y, F ( x )) , F ( x ) k =1 where L ( y, ˆ y ) is the loss function defined as a matrix : L ( y, ˆ y ) = ( l y, ˆ y ) K × K with v-shaped rows and zeros on diagonal.   0 1 2 L ( y, ˆ y ) = 1 0 1     2 1 0

Ordinal Classification : • A natural choice of the loss matrix is the absolute-error loss for which l y, ˆ y = | y − ˆ y | . • The optimal prediction in this case is median over class distribution: F ∗ ( x ) = median p k ( x ) ( y ) . • Median does not depend on a distance between class labels , so the scale of the decision attribute does not matter; the order of labels is taken into consideration only.

Two Approaches to Ordinal Classification : • Threshold Loss Minimization (SVOR, ORBoost-All, MMMF), • Rank Loss Minimization (RankSVM, RankBoost). In both approaches, one assumes existence of: • ranking (or utility ) function f ( x ) , and • consecutive thresholds θ = ( θ 0 , . . . , θ K ) on a range of the ranking function, and the final prediction is given by: K � F ( x ) = k � f ( x ) ∈ [ θ k − 1 , θ k ) � . k =1

Threshold Loss Minimization : • Threshold loss function is defined by: K − 1 � L ( y, f ( x ) , θ ) = � y k ( f ( x ) − θ k ) � 0 � , k =1 where y k = 1 , if y > k, and y k = − 1 , otherwise . θ 0 = −∞ ... θ 1 = − 3.5 θ 2 = − 1.2 θ k − 1 = = 1.2 θ k − θ − 2 = 3.8 θ K = ∞ ... ... −5 −4 −3 −2 −1 0 1 2 3 4 5 f ( x )

Rank Loss Minimization : • Rank loss function is defined over pairs of objects: L ( y ◦• , f ( x ◦ ) , f ( x • )) = � y ◦• ( f ( x ◦ ) − f ( x • )) � 0 � , where y ◦• = sgn( y ◦ − y • ) . • Thresholds are computed afterwards with respect to a given loss matrix . y i 1 > y i 2 > y i 3 > . . . > y i N − 1 > y i N f ( x i 1 ) > f ( x i 3 ) > f ( x i 2 ) > . . . > f ( x i N − 1 ) > f ( x i N )

Comparison of the two approaches : Threshold loss: • Comparison of an object to thresholds instead to all other training objects . • Weighted threshold loss can approximate any loss matrix. Rank loss: • Minimization of the rank loss on training set has quadratic complexity with respect to a number of object, however, in the case of K ordered classes, the algorithm can work in linear time . • Rank loss minimization is closely related to maximization of AUC criterion .

1 Ordinal Classification 2 RankRules 3 Conclusions

RankRules : • Ranking function is an ensemble of decision rules : M � f ( x ) = r m ( x ) , m =1 where r m ( x ) = α m Φ m ( x ) is a decision rule defined by a response α m ∈ R , and an axis-parallel region in attribute space Φ m ( x ) ∈ { 0 , 1 } . • Decision rule can be seen as logical pattern : if [condition] then [decision].

RankRules : • RankRules follows the rank loss minimization. • We use the boosting approach to learn the ensemble. • The rank loss is upper-bounded by the exponential function: L ( y, f ) = exp( − yf ) . • This is a convex function, which makes the minimization process easier to cope with. • Due to modularity of the exponential function, minimization of the rank loss can be performed in a fast way.

RankRules : • In the m -th iteration , the rule is computed by: � w ij e − α (Φ m ( x i ) − Φ m ( x j )) , r m = arg min Φ ,α y ij > 0 where f m − 1 is rule ensemble after m − 1 iterations, and w ij = e − ( f m − 1 ( x i ) − f m − 1 ( x j )) can be treated as weights associated with pairs of training examples. • The overall loss changes only for pairs in which one example is covered by the rule and the other is not ( Φ( x i ) � = Φ( x j ) ).

RankRules : • Thresholds are computed by: N K − 1 � � e − y ik ( f ( x i ) − θ k ) , θ = arg min θ i =1 k =1 subject to θ 0 = −∞ � θ 1 � . . . � θ K − 1 � θ K = ∞ . • The problem has a closed-form solution :: � N i =1 � y ik > 0 � e f ( x i ) θ k = 1 2 log i =1 � y ik < 0 � e − f ( x i ) , k = 1 , . . . , K − 1 . � N • The monotonicity condition is satisfied by this solution as proved by Lin and Li (2007).

Single Rule Generation : • The m -th rule is obtained by solving: � w ij e − α (Φ m ( x i ) − Φ m ( x j )) . r m = arg min Φ ,α y ij > 0 • For given Φ m the problem of finding α m has a closed-form solution : � y ij > 0 ∧ Φ m ( x i ) > Φ m ( x j ) w ij α m = 1 2 ln . � y ij > 0 ∧ Φ m ( x i ) < Φ m ( x j ) w ij • The challenge is to find Φ m by deriving the impurity measure L (Φ m ) in such a way that the optimization problem does not longer depend on α m .

Boosting Approaches and Impurity Measures : • Simultaneous minimization : finds the closed-form solution for Φ (Confidence-rated AdaBoost, SLIPPER, RankBoost). • Gradient descent : relies on approximation of the loss function up to the first order (AdaBoost, AnyBoost). • Gradient boosting : minimizes the squared-error between rule outputs and the negative gradient of the loss function (Gradient Boosting Machine, MART). • Constant-step minimization : restricts α ∈ {− β, β } , with β being a fixed parameter.

Boosting Approaches and Impurity Measures : • Each of the boosting approaches provides another impurity measure that represents different trade-off between misclassification and coverage of the rule. • Gradient descent produces the most general rules in comparison to other techniques. • Gradient descent represents 1 2 trade-off between misclassification and coverage of the rule. • Constant-step minimization generalizes the gradient descent technique to obtain different trade-offs between misclassification and coverage of the rule, namely ℓ ∈ [0 , 0 . 5) , with β = ln 1 − ℓ . ℓ

Rule Coverage (artificial data) 500 Number of covered training examples RR SM−Exp ν ν = = 0.1 ζ ζ = = 0.25 RR CS−Exp β = 0.1 ν = 0.1 ζ = 0.25 β = ν = ζ = RR CS−Exp β = 0.2 ν = 0.1 ζ = 0.25 β = ν = ζ = 400 RR CS−Exp β = 0.5 ν = 0.1 ζ = 0.25 β = ν = ζ = RR GD−Exp β β = = 0 ν ν = = 0.1 ζ ζ = = 0.25 RR GB−Exp ν ν = = 0.1 ζ ζ = = 0.25 300 200 100 0 0 200 400 600 800 1000 Rule

Fast Implementation : • We rewrite the minimization problem of complexity O ( N 2 ) : � w ij e − α (Φ m ( x i ) − Φ m ( x j )) , r m = arg min Φ ,α y ij > 0 to the problem that can be solved in O ( KN ) . • We use the fact that w ij = e − ( f m − 1 ( x i ) − f m − 1 ( x j )) = e − f m − 1 ( x i ) e f m − 1 ( x j ) = w i w − j , and use denotation: W 0 � w − � w − W k = i , k = i . y i = k ∧ Φ( x i )=1 y i = k ∧ Φ( x i )=0

Fast Implementation : • The minimization problem can be rewritten to N � w i e − α (Φ m ( x i )) � j e α Φ m ( x j ) , w − r m = arg min Φ ,α i =1 y i >y j where the inner sum can be given by: j e α Φ m ( x j ) = e α � W 0 � � w − W k + k . y i >y j y i >k y i >k • The values W 0 W k and k , k = 1 , . . . , K, can be easily computed and updated in each iteration.

Fast Implementation 800 RR SM−Exp ν = = 0.1 ζ ζ = = 1 RR SM−Exp ν = = 0.1 ζ ζ = = 0.5 600 Time 400 200 0 0 2000 4000 6000 8000 10000 Number of training instances

Regularization : • The rule is shrinked (multiplied) by the amount ν ∈ (0 , 1] towards rules already present in the ensemble: f m ( x ) = f m − 1 ( x ) + ν · r m ( x ) . • Procedure for finding Φ m works on a fraction ζ of original data, drawn without replacement. • Value of α m is calculated on all training examples – this usually decreases | α m | and plays the role of regularization .

Decision Rule-based Algorithm for Ordinal Classification based on - PowerPoint PPT Presentation

Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization Krzysztof Dembczyski 1 , 2 , Wojciech Kotowski 1 , 3 1 Institute of Computing Science, Pozna University of Technology 2 KEBI, Philipps-Universit at

Ordinal social ranking : simulations for CP-majority rule Nicolas Fayard 1 and Meltem ztrk 1 1

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

Ordinal Numbers and the Axiom of Substitution Bernd Schr oder logo1 Bernd Schr oder

Representations of Ordinal Numbers Juan Sebasti an C ardenas-Rodr guez Andr es

Complexity Bounds for Ordinal-Based Termination Sylvain Schmitz with numerous colleagues: Sergio

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Overview Decision Theory Classification and Bayes decision rule Sampling vs diagnostic paradigm

An algebraic structure for Dominance-based Rough Set Approach to ordinal classification

Partition Properties for Non-Ordinal Sets Under the Axiom of Determinacy Jared Holshouser

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

der Informatik Moritz Mhlhausen Prof. Marcus Magnor

MySQL Online Schema Changes at Uber and Tango Ben Black and David Turner Who are we? Ben

Stanley, N.W. Tasmania Settled 1826 by the English Van Diemens Land Company Population

c i,j max k,m c k,m 4 Wednesday, 2 Oct. 2019 Machine Learning (COMP 135) 3 Wednesday, 2

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted

INTEGRAL PRIVACY COMPLIANT STATISTICS COMPUTATION NAVODA SENAVIRATHNE UNIVERSITY OF SKVDE,

Safety Assurance in in Cyber-Physical Systems buil ilt wit ith Le Learning-Enabled Components

Frequentist Statistics DS GA 1002 Probability and Statistics for Data Science