Advances in Credit Scoring: combining performance and interpretation in Kernel Discriminant Analysis. Caterina Liberati DEMS Università degli Studi di Milano-Bicocca, Milan, Italy caterina.liberati@unimib.it November 10 th 2017 Liberati 1 / 36
Outline Motivation 1 Kernel-Induced Feature Space 2 Our Proposal 3 Examples 4 November 10 th 2017 Liberati 2 / 36
Motivation Credit Scoring: Performance vs Interpretation Learning Task with Standard Techniques The objective of quantitative Credit Scoring (CS) is to develop accurate models that can distinguish between good and bad applicants (Baesens et al, 2003). The CS → supervised classification problem: Linear discriminant analysis (Mays 2004; Duda et al. 2000), logistic regression and their variations (Wiginton 1980; Hosmer and Lemeshow 1989; Back et al. 1996) Modeling CS with Machine Learning Algorithms A variety of techniques have been applied in modeling CS: Neural Networks (Malhotra and Malhotra, 2003; West, 2000), Decision Trees (Huang et al, 2006), k-Nearest Neighbor classifiers (Henley and Hand, 1996; Piramuthu, 1999) Comparisons with standard data mining tools highlighted the superiority of such algorithms with respect to the standard classification tools November 10 th 2017 Liberati 3 / 36
Motivation Credit Scoring: Performance vs Interpretation Kernel-based Discriminants Significant theoretical advances in Machine Learning produced new algorithms’ category based on works of Vapnik (1995-1998). He points out that learning can be simpler if one uses low complexity classifiers in high dimensional space ( F ). The usage of kernel mappings makes it possible to project data implicitly in the Feature Space ( F ) through the inner product operator Due to the flexibility and remarkably good performance, the popularity of such algorithms grew quickly. Performance vs Interpretation Kernel-based classifiers are able to capture non-linearities in the data, at the same time, they have an inability to provide an explanation, or comprehensible justification, for the solutions they reach (Barakat and Bradley 2010). November 10 th 2017 Liberati 4 / 36
Kernel-Induced Feature Space Complex Classification Tasks x 2 x 6 x x x x x 4 x 2 x o o o 0 o x o o −2 o o o o x 1 x x −4 o o o −6 x x x x −8 x x x x x x x −10 −12 −10 −8 −6 −4 −2 0 2 4 6 8 Figure: Examples of complex data structures. November 10 th 2017 Liberati 5 / 36
Kernel-Induced Feature Space Do we need Kernels? The complexity of the target function to be learned depends on the way it is represented and the difficulty of the learning task can vary accordingly (figure from Schölkopf and Smola (2002)). φ : R 2 R 3 → √ ( z 1 , z 2 , z 3 ) = ( x 2 2 x 1 x 2 , x 2 ( x 1 , x 2 ) → 1 , 2 ) √ √ ( x 2 2 x 1 x 2 , x 2 2 )( z 2 2 z 1 z 2 , z 2 2 ) T ( φ ( x ) , φ ( z )) = 1 , 1 , (( x 1 , x 2 )( z 1 , z 2 ) T ) 2 = ( x · z ) 2 = November 10 th 2017 Liberati 6 / 36
Kernel-Induced Feature Space Making Kernels Kernel converts a non linear problem into a linear one by projecting data onto a high dimensional Feature Space F without knowing the mapping function explicitly. k : X 2 → R which for all pattern sets { x 1 , x 2 .. x n } ⊂ X and with X ⊂ R p , give rise to positive matrices K ij = k ( x i , x j ) If the Mercer’s theorem is satisfied (Mercer, 1909), the kernel k corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map φ : R p → F and taking the dot product there (Vapnik, 1995), i.e. k ( x , z ) = ( φ ( x ) · φ ( z )) (1) if Mercer’s theorem is satisfied K ij = k ( x i , x j ) is a Reproducing Kernel Hilbert Space (RKHS). November 10 th 2017 Liberati 7 / 36
Kernel-Induced Feature Space Advantages of Learning with Kernels Among the others a RKHS has nice property: K ( x , z ) 2 � K ( x , x ) · K ( z , z ) ∀ x , z ∈ X (2) The Cauchy-Schwarz inequality allows us to view K as a measure of similarity between inputs. If x , z ∈ X are similar then K ( x , z ) will be closer to 1 while if x , z ∈ X are dissimilar then K ( x , z ) will be closer to 0. K ( x , z ) is a space of similarities among instances The freedom to choose the mapping k will enable us to design a large variety of learning algorithms. If map is chosen suitably, complex relations can be simplified and easily detected. November 10 th 2017 Liberati 8 / 36
Kernel-Induced Feature Space Kernel Discriminant Analysis Assume that we are given the input data set I XY = { ( x 1 , y 1 ) , ..., ( x n , y n ) } of training vectors x i ∈ X and the corresponding values of y i ∈ Y = { 1 , 2 } be sets of indices. The class separability in a direction of the weights ω ∈ F is obtained maximizing the Rayleigh coefficient (Baudat and Anouar, 2000): J ( ω ) = ω ′ S φ B ω (3) ω ′ S φ W ω From the theory of reproducing kernel the solution of ω ∈ F must lie in the span of all the training samples in F . We can notice that ω can be formed by a linear expansion of training samples as follows: n � ω = α i φ ( x i ) (4) i = 1 November 10 th 2017 Liberati 9 / 36
Kernel-Induced Feature Space Kernel Discriminant Analysis As already showed by Mika et al (2003) the S Φ B and S Φ W can be easily written as ω ′ S φ B ω = α ′ M α (5) where M = ( m 1 − m 2 )( m 1 − m 2 ) ′ � n g � n k = 1 k ( x i , x g 1 m g = k ) , g = 1 , 2. n g i = 1 ω ′ S φ W ω = α ′ N α (6) N = � 2 g = 1 K g ( I − L g ) K ′ g K g a kernel matrix with a generic element ( i th , k th ) equal to k ( x i , x g k ) I the identity matrix L g a matrix with all entries n − 1 g November 10 th 2017 Liberati 10 / 36
Kernel-Induced Feature Space Kernel Discriminant Analysis These evidences allow to boils down the optimization problem of eq. 3 into finding the class separability directions α of the following maximization criterion: J ( α ) = α ′ M α (7) α ′ N α This problem can be solved by finding the leading eigenvectors of N − 1 M . Since the proposed setting is ill-posed, because N is at most of rank n-1, we employed a regularization method. The classifier is: n � f ( x ) = α i k ( x i , x ) + b (8) i = 1 b = α ′ 1 2 ( m 1 + m 2 ) (9) November 10 th 2017 Liberati 11 / 36
Kernel-Induced Feature Space KDA into SVM formulation The linear classifier can be reviewed into SVM formulation as LS-SVM. Consider a binary classification model in the Reproducing Kernel Hilbert space: f ( x ) = ω ′ φ ( x ) + b (10) where ω is the weight vector in RKHS, and b ∈ R which is called as the bias term. The discriminant function of LS-SVM classifiers (Suykens and Vandewalle 1999 is constructed by minimizing the following problem: n J ( ω, e ) = 1 2 ω ′ ω + 1 � e 2 Min 2 γ (11) i i = 1 y i = ω ′ φ ( x i ) + b + e i i=1,2,...n Such that: November 10 th 2017 Liberati 12 / 36
Kernel-Induced Feature Space KDA into SVM formulation The Lagrangian of problem (eq. 13) is expressed by: n � α i ( ω ′ φ ( x i ) + b − y i + e i ) L ( ω, b , e ; α ) = J ( ω, b , e ) − (12) i = 1 where α i ∈ R are the Lagrange multipliers, which can be positive or negative in this formulation. The conditions for optimality yield: ∂ L 0 ⇒ ω = � n = i = 1 α i φ ( x i ) ∂ω 0 ⇒ � m ∂ L = i = 1 α i = 0 ∂ b (13) ∂ L = 0 ⇒ α i = γ e ∂ξ ∂ L 0 ⇒ y i ( ω ′ φ ( x i ) + b ) − 1 + e i = 0 ∀ i = 1 , 2 , .. n = ∂α i The solution is found by solving a system of linear equations in eq. 15 (Kuhn and Tucker 1951). The fitting function namely the output of LS-SVM is: n � f ( x ) = α i k ( x i , x ) + b (14) i = 1 November 10 th 2017 Liberati 13 / 36
Kernel-Induced Feature Space LS-SVM vs SVM LS-SVM vs SVM The major drawback of SVM lies in the estimation procedure based on the constrained optimization programming (Wang and Hu 2015), therefore the computation burden becomes particularly heavy for large scale problems. In such cases LS-SVM is preferred because its solution is based on solving a linear set of equations (Suykens and Vandewalle 1999). KDA vs SVM SVMs do not deal with multi-class problem directly when data structures present more than 2 groups unless we use any OAA OAO classifications. November 10 th 2017 Liberati 14 / 36
Kernel-Induced Feature Space Kernel Settings The most common kernel mappings: Kernel Mapping k(x,z) 1 Cauchy 1 + || x − z || 2 c � � x − z � 2 Laplace exp ( − ) c 2 || x − z || 2 + c 2 � Multi-quadric ( x · z ) 2 Polynomial degree 2 exp ( −� x − z � 2 Gaussian (RBF) ) 2 c 2 tanh [ c ( x · z ) + 1 ] Sigmoidal (SIG) Tuning parameter is set trough some grid search algorithms Regularization methods to overcome the singularity of S Φ W (Friedman 1989; Mika 1999) REG Model selection criteria for choosing the best kernel function (Error Rate, AUC, Information Criteria) SEL November 10 th 2017 Liberati 15 / 36
Our Proposal Our Proposal An operative strategy Our goal is NOT to derive a new classification model that discriminates better than others previously published. I Selection of the best Kernel Discriminant function (a) Computing kernel matrix using as inputs original variables (b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps (c) Selecting the best Kernel Discriminant f ( x ) via minimum misclassification error rate or maximization of AUC November 10 th 2017 Liberati 16 / 36
Recommend
More recommend