gradient boosted trees w ith xgboost
play

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G - PowerPoint PPT Presentation

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y Decision trees Creates predictions similar to logistic regression Not str u ct u red like a regression


  1. Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  2. Decision trees Creates predictions similar to logistic regression Not str u ct u red like a regression CREDIT RISK MODELING IN PYTHON

  3. Decision trees for loan stat u s Simple decision tree for predicting loan_status probabilit y of defa u lt CREDIT RISK MODELING IN PYTHON

  4. Decision tree impact Loan Tr u e loan stat u s Pred . Loan Stat u s Loan pa y o � v al u e Selling Val u e Gain / Loss 1 0 1 $1,500 $250 -$1,250 2 0 1 $1,200 $250 -$950 CREDIT RISK MODELING IN PYTHON

  5. A forest of trees XGBoost u ses man y simplistic trees ( ensemble ) Each tree w ill be slightl y be � er than a coin toss CREDIT RISK MODELING IN PYTHON

  6. Creating and training trees Part of the xgboost P y thon package , called xgb here Trains w ith .fit() j u st like the logistic regression model # Create a logistic regression model clf_logistic = LogisticRegression() # Train the logistic regression clf_logistic.fit(X_train, np.ravel(y_train)) # Create a gradient boosted tree model clf_gbt = xgb.XGBClassifier() # Train the gradient boosted tree clf_gbt.fit(X_train,np.ravel(y_train)) CREDIT RISK MODELING IN PYTHON

  7. Defa u lt predictions w ith XGBoost Predicts w ith both .predict() and .predict_proba() .predict_proba() prod u ces a v al u e bet w een 0 and 1 .predict() prod u ces a 1 or 0 for loan_status # Predict probabilities of default gbt_preds_prob = clf_gbt.predict_proba(X_test) # Predict loan_status as a 1 or 0 gbt_preds = clf_gbt.predict(X_test) # gbt_preds_prob array([[0.059, 0.940], [0.121, 0.989]]) # gbt_preds array([1, 1, 0...]) CREDIT RISK MODELING IN PYTHON

  8. H y perparameters of gradient boosted trees H y perparameters : model parameters ( se � ings ) that cannot be learned from data Some common h y perparameters for gradient boosted trees learning_rate : smaller v al u es make each step more conser v ati v e max_depth : sets ho w deep each tree can go , larger means more comple x xgb.XGBClassifier(learning_rate = 0.2, max_depth = 4) CREDIT RISK MODELING IN PYTHON

  9. Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

  10. Col u mn selection for credit risk C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  11. Choosing specific col u mns We 'v e been u sing all col u mns for predictions # Selects a few specific columns X_multi = cr_loan_prep[['loan_int_rate','person_emp_length']] # Selects all data except loan_status X = cr_loan_prep.drop('loan_status', axis = 1) Ho w y o u can tell ho w important each col u mn is Logistic Regression : col u mn coe � cients Gradient Boosted Trees : ? CREDIT RISK MODELING IN PYTHON

  12. Col u mn importances Use the .get_booster() and .get_score() methods Weight : the n u mber of times the col u mn appears in all trees # Train the model clf_gbt.fit(X_train,np.ravel(y_train)) # Print the feature importances clf_gbt.get_booster().get_score(importance_type = 'weight') {'person_home_ownership_RENT': 1, 'person_home_ownership_OWN': 2} CREDIT RISK MODELING IN PYTHON

  13. Col u mn importance interpretation # Column importances from importance_type = 'weight' {'person_home_ownership_RENT': 1, 'person_home_ownership_OWN': 2} CREDIT RISK MODELING IN PYTHON

  14. Plotting col u mn importances Use the plot_importance() f u nction xgb.plot_importance(clf_gbt, importance_type = 'weight') {'person_income': 315, 'loan_int_rate': 195, 'loan_percent_income': 146} CREDIT RISK MODELING IN PYTHON

  15. Choosing training col u mns Col u mn importance is u sed to sometimes decide w hich col u mns to u se for training Di � erent sets a � ect the performance of the models Model Model Defa u lt Col u mns Importances Acc u rac y Recall loan _ int _ rate , person _ emp _ length (100, 100) 0.81 0.67 loan _ int _ rate , person _ emp _ length , (98, 70, 5) 0.84 0.52 loan _ percent _ income CREDIT RISK MODELING IN PYTHON

  16. F 1 scoring for models Thinking abo u t acc u rac y and recall for di � erent col u mn gro u ps is time cons u ming F 1 score is a single metric u sed to look at both acc u rac y and recall Sho w s u p as a part of the classification_report() CREDIT RISK MODELING IN PYTHON

  17. Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

  18. Cross v alidation for credit models C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  19. Cross v alidation basics Used to train and test the model in a w a y that sim u lates u sing the model on ne w data Segments training data into di � erent pieces to estimate f u t u re performance Uses DMatrix , an internal str u ct u re optimi z ed for XGBoost Earl y stopping tells cross v alidation to stop a � er a scoring metric has not impro v ed a � er a n u mber of iterations CREDIT RISK MODELING IN PYTHON

  20. Ho w cross v alidation w orks Processes parts of training data as ( called folds ) and tests against u n u sed part Final testing against the act u al test set 1 2 h � ps :// scikit learn . org / stable / mod u les / cross _v alidation . html CREDIT RISK MODELING IN PYTHON

  21. Setting u p cross v alidation w ithin XGBoost # Set the number of folds n_folds = 2 # Set early stopping number early_stop = 5 # Set any specific parameters for cross validation params = {'objective': 'binary:logistic', 'seed': 99, 'eval_metric':'auc'} 'binary':'logistic' is u sed to specif y classi � cation for loan_status 'eval_metric':'auc' tells XGBoost to score the model ' s performance on AUC CREDIT RISK MODELING IN PYTHON

  22. Using cross v alidation w ithin XGBoost # Restructure the train data for xgboost DTrain = xgb.DMatrix(X_train, label = y_train) # Perform cross validation xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds, early_stopping_rounds=early_stop) DMatrix() creates a special object for xgboost optimi z ed for training CREDIT RISK MODELING IN PYTHON

  23. The res u lts of cross v alidation Creates a data frame of the v al u es from the cross v alidation CREDIT RISK MODELING IN PYTHON

  24. Cross v alidation scoring Uses cross v alidation and scoring metrics w ith cross_val_score() f u nction in scikit - learn # Import the module from sklearn.model_selection import cross_val_score # Create a gbt model xg = xgb.XGBClassifier(learning_rate = 0.4, max_depth = 10) # Use cross valudation and accuracy scores 5 consecutive times cross_val_score(gbt, X_train, y_train, cv = 5) array([0.92748092, 0.92575308, 0.93975392, 0.93378608, 0.93336163]) CREDIT RISK MODELING IN PYTHON

  25. Let ' s practice ! C R E D IT R ISK MOD E L IN G IN P YTH ON

  26. Class imbalance in loan data C R E D IT R ISK MOD E L IN G IN P YTH ON Michael Crabtree Data Scientist , Ford Motor Compan y

  27. Not eno u gh defa u lts in the data The v al u es of loan_status are the classes Non - defa u lt : 0 Defa u lt : 1 y_train['loan_status'].value_counts() loan _ stat u s Training Data Co u nt Percentage of Total 0 13,798 78% 1 3,877 22% CREDIT RISK MODELING IN PYTHON

  28. Model loss f u nction Gradient Boosted Trees in xgboost u se a loss f u nction of log - loss The goal is to minimi z e this v al u e Tr u e loan stat u s Predicted probabilit y Log Loss 1 0.1 2.3 0 0.9 2.3 An inacc u ratel y predicted defa u lt has more negati v e � nancial impact CREDIT RISK MODELING IN PYTHON

  29. The cost of imbalance A false negati v e ( defa u lt predicted as non - defa u lt ) is m u ch more costl y Person Loan Amo u nt Potential Pro � t Predicted Stat u s Act u al Stat u s Losses A $1,000 $10 Defa u lt Non - Defa u lt -$10 B $1,000 $10 Non - Defa u lt Defa u lt -$1,000 Log - loss for the model is the same for both , o u r act u al losses is not CREDIT RISK MODELING IN PYTHON

  30. Ca u ses of imbalance Data problems Credit data w as not sampled correctl y Data storage problems B u siness processes : Meas u res alread y in place to not accept probable defa u lts Probable defa u lts are q u ickl y sold to other � rms Beha v ioral factors : Normall y, people do not defa u lt on their loans The less o � en the y defa u lt , the higher their credit rating CREDIT RISK MODELING IN PYTHON

  31. Dealing w ith class imbalance Se v eral w a y s to deal w ith class imbalance in data Method Pros Cons Increases n u mber of Gather more data Percentage of defa u lts ma y not change defa u lts Increases recall for Model req u ires more t u ning and Penali z e models defa u lts maintenance Sample data Least technical Fe w er defa u lts in data di � erentl y adj u stment CREDIT RISK MODELING IN PYTHON

  32. Undersampling strateg y Combine smaller random sample of non - defa u lts w ith defa u lts CREDIT RISK MODELING IN PYTHON

  33. Combining the split data sets Test and training set m u st be p u t back together Create t w o ne w sets based on act u al loan_status # Concat the training sets X_y_train = pd.concat([X_train.reset_index(drop = True), y_train.reset_index(drop = True)], axis = 1) # Get the counts of defaults and non-defaults count_nondefault, count_default = X_y_train['loan_status'].value_counts() # Separate nondefaults and defaults nondefaults = X_y_train[X_y_train['loan_status'] == 0] defaults = X_y_train[X_y_train['loan_status'] == 1] CREDIT RISK MODELING IN PYTHON

Recommend


More recommend