Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Improving Electric fraud detection using class imbalance strategies Eng. Federico Decia Eng. Matías Di Martino Eng. Juan Molinelli Prof. Alicia Fernández Instituto de Ingeniería Eléctrica, Facultad de Ingeniería Universidad de la República Montevideo, Uruguay.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Introduction
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description Nontechnical losses represent a very high cost to power supply companies. Background Research in different countries has been made to tackle this problem Ramos et al., 2010 ← Brazil Nagi and Mohamad, 2010 ← Malaysia Muniz et al., 2009 ← Rio de Janeiro, Brazil Uruguay In Uruguay the national electric power company (henceforth call UTE) faces the problem by manually monitoring a group of customers.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description Difficulties Big number of customers (only in the capital city there are 500.000) Wide variety of scams and ways to alter consumption meters.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description Other factors: Fraud history. Building address and dimensions. Counter type.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description Objective Develop an automatic tool that, based on manually labeled data, detect new suspect customers.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Problem description Data Description
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Data DATASET 1 1504 industrial profiles (October 2004- September 2009) Each profile is represented by the customers monthly consumption. Labels for each customer are provided by UTE. Used for training and theoretical evaluation DATASET 2 3300 industrial profiles (January 2008 - January 2011 ) Each profile is represented by the customers monthly consumption. Used for on field evaluation
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Data
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud. Strategies Under-Sampling
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud. Strategies Under-Sampling One Class SVM and Cost Sensitive SVM
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem The class imbalance problem When working on fraud detection field, one can not assume that the number of people who commit fraud are the same than those who do not, usually there are fewer elements from the class who commit fraud. Strategies Under-Sampling One Class SVM and Cost Sensitive SVM Recall, Precision and F-Value
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem TP Recall p = TP + FN TN Recall n = TN + FP TP Precision = TP + FP F value = ( 1 + β 2 ) Recall p × Precision β 2 Recall p + Precision Labeled as Positive Negative Positive TP (True Positive) FN (False Negative) Negative FP (False Positive) TN (True Negative) Table: Confusion matrix
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Class imbalance problem Strategy proposed
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Block diagram Block Diagram The system input corresponds to the last three years of the monthly consumption curve of each costumer. Figure: Block Diagram
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Block diagram Features
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Consumption ratio for the last 3, 6 and 12 months and the average consumption.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Norm of the difference between the expected consumption and the actual consumption. Figure: Consumptions are estimate based on the consumption of the same month of the previous year multiplied by the ratio between the mean consumption of each year.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Difference between Fourier and Wavelets coefficients from the last and previous years.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Difference between the coefficients of the polynomial Figure: Difference in the coefficients of the polynomial that best fits the consumption curve.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Distance to the mean customer Figure: We compare each consumption curve with the mean curve.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Variance Figure: Changes in the variance value Figure: Comparison with the variance normal value
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Global Characteristic Figure: Slope of the straight line Figure: Module of the first five that best fits the consumption Fourier coefficients. curve.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Selection Features Selection Evaluation methods used Filter → CfsSubsetEval (Weka) Wrapper aiming to improve the F-Value on the different classifiers considered. Search methods used Exhaustive search (only for CfsSubsetEval) Best First (for all other methods)
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Features Selection Classifiers
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers Classifiers Classifiers used Classifiers considered to tackle this problem were: One Class Support Vector Machine (O-SVM) Cost-Sensitive Support Vector Machine (CS-SVM) Optimum Path Forest (OPF) Tree C4.5
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers SVM Parameters Compromise: O-SVM → ν ; CS-SVM → C Kernel: Gaussian → γ Optimal parameters were found using 10-fold cross validation. The performance was measure by F-value.
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers Optimum Path Forest Euclidean distance was used as the distance function. Raw input vectors (instead of the features here proposed). Under-sample the majority class (during the training step of the algorithm) to improve the final performance (in the F-value sense).
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers C4.5 The fourth classifier used is a decision tree proposed by Ross Quinlan: C4.5 As with OPF, we under-sample the majority class (during the training step of the algorithm) to improve the final performance (in the F-value sense).
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers Combination Combination
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers Combination Why? Improve final performance More robust and general solution Decision rule g p ( x ) = λ p os d p os + λ p cs d p cs + λ p OPF d p OPF + λ p Tree d p (1) Tree g n ( x ) = λ n os d n os + λ n cs d n cs + λ n OPF d n OPF + λ n Tree d n (2) Tree d i j ( x ) = 1 if j labels the sample as i and 0 otherwise. g p ( x ) > g n ( x ) → x labeled as positive g p ( x ) < g n ( x ) → x labeled as negative
Problem description Data Imbalance problem Strategy Proposed Results and Conclusions Classifiers Weight Traditional Case (Kuncheva, 2004), weighted majority vote rule: Hypothesis: independence Objective: maximize overall accuracy Accuracy j � � λ j = log 1 − Accuracy j Accuracy j : ratio of correctly classified samples for the j th classifier.
Recommend
More recommend