This task required that I build and evaluate several machine learning models to predict credit risk using data you'd typically see from peer-to-peer lending services.
Credit risk is an inherently imbalanced classification problem (the number of good loans is much larger than the number of at-risk loans), so I employed different techniques for training and evaluating models with imbalanced classes.
I used the imbalanced-learn and Scikit-learn libraries to build and evaluate models using the two following techniques, each contained in their own code file:
I completed a simple linear regression. I then generated the balanced accuracy score, confusion matrix and classification report, as below:
In this section, I compared two oversampling algorithms to determine which algorithm results in the best performance.
These algorithms were:
- naive random oversampling algorithm
- SMOTE algorithm.
I then generated the balanced accuracy score, confusion matrix and classification report for each.
See naive random oversampling algorithm results below:
See SMOTE algorithm results below:
In this section, I tested an undersampling algorithm to determine which algorithm results in the best performance compared to the oversampling algorithms above.
I undersampled the data using the Cluster Centroids algorithm.
I then generated the balanced accuracy score, confusion matrix and classification report.
See Cluster Centroids algorithm results below:
In this section, I tested a combination over- and under-sampling algorithm to determine if the algorithm results in the best performance compared to the other sampling algorithms above.
I resampled the data using the SMOTEENN algorithm.
I then generated the balanced accuracy score, confusion matrix and classification report.
See the SMOTEENN algorithm results below:
Model | Balanced Accuracy Score |
---|---|
Simple logistic regression | 0.9892813049736127 |
Naive random oversampling | 0.9946414201183431 |
SMOTE oversampling | 0.9946680739911509 |
Cluster Centroids undersampling | 0.9932813049736127 |
SMOTEENN combination sampling | 0.9946680739911509 |
SMOTE oversampling and combination sampling with SMOTEENN had the best balanced accuracy scores.
Model | High Risk | Low Risk | Average |
---|---|---|---|
Simple logistic regression | 0.98 | 0.99 | 0.99 |
Naive random oversampling | 1.00 | 0.99 | 0.99 |
SMOTE oversampling | 1.00 | 0.99 | 0.99 |
Cluster Centroids undersampling | 0.99 | 0.99 | 0.99 |
SMOTEENN combination sampling | 1.00 | 0.99 | 0.99 |
Naive random oversampling, SMOTE oversampling and combination sampling with SMOTEENN have the best recall scores.
Model | High Risk | Low Risk | Average |
---|---|---|---|
Simple logistic regression | 0.99 | 0.99 | 0.99 |
Naive random oversampling | 0.99 | 0.99 | 0.99 |
SMOTE oversampling | 0.99 | 0.99 | 0.99 |
Cluster Centroids undersampling | 0.99, | 0.99 | 0.99 |
SMOTEENN combination sampling | 0.99 | 0.99 | 0.99 |
Every model had the same geometric mean score.
In this section, I trained and compared two different ensemble classifiers to predict loan risk and evaluate each model.
These were:
- Balanced Random Forest Classifier
- Easy Ensemble Classifier
I generated the balanced accuracy score, confusion matrix and classification report for each.
See the Balanced Random Forest Classifier results below:
See the Easy Ensemble Classifier results below:
Model | Balanced Accuracy Score |
---|---|
Balanced Random Forest Classifier | 0.795829959187949 |
Easy Ensemble Classifier | 0.9263912558266958 |
Easy Ensemble Classifier had the best balanced accuracy score.
Model | High Risk | Low Risk | Average |
---|---|---|---|
Balanced Random Forest Classifier | 0.71 | 0.88 | 0.88 |
Easy Ensemble Classifier | 0.91 | 0.94 | 0.94 |
Easy Ensemble Classifier had the best recall score.
Model | High Risk | Low Risk | Average |
---|---|---|---|
Balanced Random Forest Classifier | 0.79 | 0.79 | 0.79 |
Easy Ensemble Classifier | 0.93 | 0.93 | 0.93 |
Easy Ensemble Classifier had the best geometric mean score.
The top three features are:
- total_rec_prncp
- total_rec_int
- total_pymnt_in