Clustering Large Credit Client Data Sets for Classification with SVM Ralf Stecking University of Oldenburg Department of Economics Klaus B. Schebesch University “Vasile Goldiş” Arad Faculty of Economics University of Edinburgh Credit Scoring and Credit Control XI Conference 26.08.2009 Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 1 / 30
Overview Motivation Kernels and clustering Preliminary evaluation of cluster based SVM Multiple validation of cluster based SVM Credit scoring and data clustering Symbolic representation of credit client clusters Symbolic SVM model building and evaluation Conclusions and outlook Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 2 / 30
Motivation In past work we used a medium sized empirical credit scoring data set with N = 658 credit clients all having m = 40 input features in order to analyze different aspects of model building with regard to out of sample classification performance. As base model we used SVM and other statistical learning methods like LDA, CART, and LogReg. Thereupon model combinations of outputs of the base models were also investigated. Gaining access to a N ≈ 140 . 000 data set with m = 23 features per credit client and with extremly asymmetric class distributions, precludes case-by-case training of models like SVM. Further increasing N and “fusing” credit client information from different sources will worsen the situation. Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 3 / 30
Kernels and relations between pairs of cases (1) A training set { y i | x i } i = 1 ,..., N may contain labeled credit clients (e.g. y i ∈ {− 1 , 1 } ) or unlabeled ones ( y i = 0, all i , say). A kernel function k s ij ( x i , x j ) ≥ 0 describes a metric relation (inverted distance, etc.) between any two training feature vectors x i , x j ∈ { 1 , ..., N } . The implied numerical matrix K ij is usually meant to be symmetric. Individualized parameters for pairs of clients ij can impose conditions which may act classwise (e.g. correct for asymmetic costs) casewise (e.g. correct for case importance) or interaction-wise (i.e. 2-interactions). Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 4 / 30
Kernels and relations between pairs of cases (2) An instantiation of k ij ( x i , x j ) would be the adaptation of the very powerful RBF kernel for nonlinear SVM: r ij exp ( − s ij || x i − x j || 2 ) , with r ij ∈ { 0 , 1 } a grouping relation, and s ij ≥ 0 the interaction sensitivity. By identically permuting the index sets of i and j , i.e. via i k and j k , the matrix ( K i k j k ) resulting from the kernel may be block-diagonal, indicating a cluster structure, which in turn may or may not be dependent of the labeling { y i k } . The most popular RBF SVM is simply using r ij = 1 for all i , j and s ij = s > 0, a constant, for all i , j . Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 5 / 30
Kernels and relations between pairs of cases (3) Being able to set (at least part of) the r ij and s ij parameters would convey domain knowledge into the problem formulation. Do we have such knowledge ? Treating r ij and s ij as “slow” variables which are to be optimized along with “fast” variables, i.e. the SVM duals and slacks, is hardly tractable (for empirically reasonale N ). In order to approxiamte this task, we split the associated simulataneous problem into two consecutive tasks which are routinely tractable: Cluster the data set by a fast standard method (with and without label prepartitioning) Apply the SVM kernel to the resulting cluster prototypes. Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 6 / 30
What we do for this presentation m k[X(i),X(j)] C Data C (y|X) N Cluster C Representatives m+d + derived variables m+d Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 7 / 30
Some issues concerning cluster formation Is there any cluster structure in the data ? A clustering algorithm will issue “clusters” for very 1 ≤ c ≤ N ! Suppose there is some (empirical) clustering, are the cluster shapes compact (spheroidal) or elongated or of mixed shape in high dimensions and are they well separated ? The clustering method can be one of the following completely unsupervised ... constrained to some degree (“soft” penalty terms, using balancing and correlations etc. arguments, of “must-be” or “cannot-be” type, ... , can cluster members of predefined classes only, ...) How to cluster labeled credit client data ? Cluster the entire training data or cluster the members of predefined classes only ? Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 8 / 30
Different models trained on cluster prototypes Cluster- and case-by-case trained SVM 36 Kernels Cluster RBF SVM LIN. SVM 2-POLYNOM. SVM 34 RBF SVM Misclassification (in precent) COULOMB SVM FTE RBF SVM 50 32 60 80 30 100 110 120 28 26 24 22 0 20 40 60 80 100 120 140 160 180 200 Restart clusterings sorted by misclass.rate Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 9 / 30
ROC curves of SVM with different kernels ... ROC curves SVM models 1.0 0.8 true positive rate 0.6 0.4 kernels AUC MXE linear (cyan) 0.833 0.494 polynomial (blue) 0.846 0.505 0.2 rbf (green) 0.856 0.497 coulomb (red) 0.858 0.527 fte > rbf (black) 0.875 0.478 0.0 0.0 0.2 0.4 0.6 0.8 1.0 false positive rate Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 10 / 30
... and ROC curves of cluster based RBF SVM ROC curves of SVM models 1.0 0.8 true positive rate 0.6 0.4 kernels AUC MXE linear (black) 0.833 0.494 polynomial (blue) 0.846 0.505 0.2 rbf (green) 0.856 0.497 coulomb (red) 0.858 0.527 fte > rbf (dash) 0.875 0.478 0.0 0.0 0.2 0.4 0.6 0.8 1.0 false positive rate Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 11 / 30
Validation of cluster based SVM on a large credit client data set It is not sufficient to validate the SVM models trained on a given set of cluster representatives! Improved validation includes clustering, the outcome of which may be different when using different holdout sets: 1 Step A: Divide the training set into positive and negative cases T = P ∪ N . 2 Step B: Subdivide both P and N of a large training set (of > 100000 cases, say) into n (approx. equally sized) non-overlapping segments [ P 1 , P 2 , ..., P i , ..., P n ] and [ N 1 , N 2 , ..., N i , ..., N n ] with the smallest segment containing at least 30 cases, say. 3 Step C i : cluster both sets [ P 1 , P 2 , ..., P i − 1 , P i + 1 ..., P n ] and [ N 1 , N 2 , ..., N i − 1 , N i + 1 ..., N n ] obtaining 2 c cluster representatives. Train a SVM on these labeled 2 c points. 4 Step D i : validate the i th SVM just on the segment [ P i , N i ] . Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 12 / 30
ROC curves for SVM 50,100,200 and 400 clusters Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 13 / 30
Sorted AUC for SVM 50,100,200 and 400 clusters Validation of SVM trained on 50 pos / 50 neg CC Validation of SVM trained on 100 pos / 100 neg CC 1.1 1.1 mean AUC = 0.6446035 mean AUC = 0.6735184 mean MXE = 0.6001835 mean MXE = 0.6270437 1.0 1.0 Area under ROC (AUC, sorted) and MXE Area under ROC (AUC, sorted) and MXE 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 # block-leaveout computation # block-leaveout computation Validation of SVM trained on 200 pos / 200 neg CC Validation of SVM trained on 400 pos / 400 neg CC 1.0 1.0 0.9 Area under ROC (AUC, sorted) and MXE Area under ROC (AUC, sorted) and MXE 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.3 0.4 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 # block-leaveout computation # block-leaveout computation Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 14 / 30
Sliced ROC for SVM 50,100,200 and 400 clusters Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 15 / 30
Validation of SVM on 15 different cluster numbers Validation of SVM trained on cluster centers from two k-means start series plus output combination on computed on the holdout sets of both series 0.67 0.66 Area under ROC (AUC) 0.65 0.64 0.63 0.62 50 100 150 200 250 300 350 400 # cluster centres Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 16 / 30
Example of ROC Validation and Output combinations Validation SVM on [ nee115_100 + r1_nee115_100 ] 1 0.8 True positive rate 0.6 0.4 AUC (1) = 0.661921 AUC (2) = 0.6470367 0.2 AUC (1+2) = 0.671793 0 0 0.2 0.4 0.6 0.8 1 False positive rate Stecking and Schebesch (CRC 2009) Clustering Large Credit Data Sets 26.08.2009 17 / 30
Recommend
More recommend