Logistic regression for probabilit y of defa u lt C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y
Probabilit y of defa u lt The likelihood that someone w ill defa u lt on a loan is the probabilit y of defa u lt A probabilit y v al u e bet w een 0 and 1 like 0.86 loan_status of 1 is a defa u lt or 0 for non - defa u lt CREDIT RISK MODELING IN PYTHON
Probabilit y of defa u lt The likelihood that someone w ill defa u lt on a loan is the probabilit y of defa u lt A probabilit y v al u e bet w een 0 and 1 like 0.86 loan_status of 1 is a defa u lt or 0 for non - defa u lt Probabilit y of Defa u lt Interpretation Predicted loan stat u s 0.4 Unlikel y to defa u lt 0 0.90 Ver y likel y to defa u lt 1 0.1 Ver y u nlikel y to defa u lt 0 CREDIT RISK MODELING IN PYTHON
Predicting probabilities Probabilities of defa u lt as an o u tcome from machine learning Learn from data in col u mns ( feat u res ) Classi � cation models ( defa u lt , non - defa u lt ) T w o most common models : Logistic regression Decision tree CREDIT RISK MODELING IN PYTHON
Logistic regression Similar to the linear regression , b u t onl y prod u ces v al u es bet w een 0 and 1 CREDIT RISK MODELING IN PYTHON
Training a logistic regression Logistic regression a v ailable w ithin the scikit - learn package from sklearn.linear_model import LogisticRegression Called as a f u nction w ith or w itho u t parameters clf_logistic = LogisticRegression(solver='lbfgs') Uses the method .fit() to train clf_logistic.fit(training_columns, np.ravel(training_labels)) Training Col u mns : all of the col u mns in o u r data e x cept loan_status Labels : loan_status (0,1) CREDIT RISK MODELING IN PYTHON
Training and testing Entire data set is u s u all y split into t w o parts CREDIT RISK MODELING IN PYTHON
Training and testing Entire data set is u s u all y split into t w o parts Data S u bset Usage Portion Train Learn from the data to generate predictions 60% Test Test learning on ne w u nseen data 40% CREDIT RISK MODELING IN PYTHON
Creating the training and test sets Separate the data into training col u mns and labels X = cr_loan.drop('loan_status', axis = 1) y = cr_loan[['loan_status']] Use train_test_split() f u nction alread y w ithin sci - kit learn X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123) test_size : percentage of data for test set random_state : a random seed v al u e for reprod u cibilit y CREDIT RISK MODELING IN PYTHON
Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON
Predicting the probabilit y of defa u lt C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y
Logistic regression coefficients # Model Intercept array([-3.30582292e-10]) # Coefficients for ['loan_int_rate','person_emp_length','person_income'] array([[ 1.28517496e-09, -2.27622202e-09, -2.17211991e-05]]) # Calculating probability of default int_coef_sum = -3.3e-10 + (1.29e-09 * loan_int_rate) + (-2.28e-09 * person_emp_length) + (-2.17e-05 * person_income) prob_default = 1 / (1 + np.exp(-int_coef_sum)) prob_nondefault = 1 - (1 / (1 + np.exp(-int_coef_sum))) CREDIT RISK MODELING IN PYTHON
Interpreting coefficients # Intercept intercept = -1.02 # Coefficient for employment length person_emp_length_coef = -0.056 For e v er y 1 y ear increase in person_emp_length , the person is less likel y to defa u lt CREDIT RISK MODELING IN PYTHON
Interpreting coefficients # Intercept intercept = -1.02 # Coefficient for employment length person_emp_length_coef = -0.056 For e v er y 1 y ear increase in person_emp_length , the person is less likel y to defa u lt intercept person _ emp _ length v al u e * coef probabilit y of defa u lt -1.02 (10 * -0.06 ) 10 .17 -1.02 (11 * -0.06 ) 11 .16 -1.02 (12 * -0.06 ) 12 .15 CREDIT RISK MODELING IN PYTHON
Using non - n u meric col u mns N u meric : loan_int_rate , person_emp_length , person_income Non - n u meric : cr_loan_clean['loan_intent'] EDUCATION MEDICAL VENTURE PERSONAL DEBTCONSOLIDATION HOMEIMPROVEMENT Will ca u se errors w ith machine learning models in P y thon u nless processed CREDIT RISK MODELING IN PYTHON
One - hot encoding Represent a string w ith a n u mber CREDIT RISK MODELING IN PYTHON
One - hot encoding Represent a string w ith a n u mber 0 or 1 in a ne w col u mn column_VALUE CREDIT RISK MODELING IN PYTHON
Get d u mmies Utili z e the get_dummies() w ithin pandas # Separate the numeric columns cred_num = cr_loan.select_dtypes(exclude=['object']) # Separate non-numeric columns cred_cat = cr_loan.select_dtypes(include=['object']) # One-hot encode the non-numeric columns only cred_cat_onehot = pd.get_dummies(cred_cat) # Union the numeric columns with the one-hot encoded columns cr_loan = pd.concat([cred_num, cred_cat_onehot], axis=1) CREDIT RISK MODELING IN PYTHON
Predicting the f u t u re , probabl y Use the .predict_proba() method w ithin scikit - learn # Train the model clf_logistic.fit(X_train, np.ravel(y_train)) # Predict using the model clf_logistic.predict_proba(X_test) Creates arra y of probabilities of defa u lt # Probabilities: [[non-default, default]] array([[0.55, 0.45]]) CREDIT RISK MODELING IN PYTHON
Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON
Credit model performance C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y
Model acc u rac y scoring Calc u late acc u rac y Use the .score() method from scikit - learn # Check the accuracy against the test data clf_logistic1.score(X_test,y_test) 0.81 81% of v al u es for loan_status predicted correctl y CREDIT RISK MODELING IN PYTHON
ROC c u r v e charts Recei v er Operating Characteristic c u r v e Plots tr u e positi v e rate ( sensiti v it y) against false positi v e rate ( fall - o u t ) fallout, sensitivity, thresholds = roc_curve(y_test, prob_default) plt.plot(fallout, sensitivity, color = 'darkorange') CREDIT RISK MODELING IN PYTHON
Anal yz ing ROC charts Area Under C u r v e ( AUC ): area bet w een c u r v e and random prediction CREDIT RISK MODELING IN PYTHON
Defa u lt thresholds Threshold : at w hat point a probabilit y is a defa u lt CREDIT RISK MODELING IN PYTHON
Setting the threshold Relabel loans based on o u r threshold of 0.5 preds = clf_logistic.predict_proba(X_test) preds_df = pd.DataFrame(preds[:,1], columns = ['prob_default']) preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.5 else 0) CREDIT RISK MODELING IN PYTHON
Credit classification reports classification_report() w ithin scikit - learn from sklearn.metrics import classification_report classification_report(y_test, preds_df['loan_status'], target_names=target_names) CREDIT RISK MODELING IN PYTHON
Selecting classification metrics Select and store speci � c components from the classification_report() Use the precision_recall_fscore_support() f u nction from scikit - learn from sklearn.metrics import precision_recall_fscore_support precision_recall_fscore_support(y_test,preds_df['loan_status'])[1][1] CREDIT RISK MODELING IN PYTHON
Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON
Model discrimination and impact C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y
Conf u sion matrices Sho w s the n u mber of correct and incorrect predictions for each loan_status CREDIT RISK MODELING IN PYTHON
Defa u lt recall for loan stat u s Defa u lt recall ( or sensiti v it y) is the proportion of tr u e defa u lts predicted CREDIT RISK MODELING IN PYTHON
Recall portfolio impact Classi � cation report - Underperforming Logistic Regression model CREDIT RISK MODELING IN PYTHON
Recall portfolio impact Classi � cation report - Underperforming Logistic Regression model N u mber of tr u e defa u lts : 50,000 Loan Amo u nt Defa u lts Predicted / Not Predicted Estimated Loss on Defa u lts $50 .04 / .96 (50000 x .96) x 50 = $2,400,000 CREDIT RISK MODELING IN PYTHON
Recall , precision , and acc u rac y Di � c u lt to ma x imi z e all of them beca u se there is a trade - o � CREDIT RISK MODELING IN PYTHON
Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON
Recommend
More recommend