This project utilized six machine learning models applied to credit risk, an inherently unbalanced classification problem that requires differentiating large numbers of good loans from a much smaller number of risky loans. The data is drawn from LendingClub, a P2P lending service company, which demonstrates a typical imbalance in classes of loans. Although this problem cannot be overcome completely, these algorithms each provide some benefit in their distinct manners of resampling the data.
- Naive Random Oversampling: This algorithm produced a balanced accuracy score of 66%, with precision of 99% and recall of 60%.
- SMOTE Oversampling: This option produced results similar to Naive Random Oversampling, with a balanced accuracy score of 66%, precision of 99% and recall of 69%.
- ClusterCentroids Undersampling: Again, a comparable balanced accuracy score of 66% and precision of 99%, but with a much lower recall of 40%.
- SMOTEEN Combination Sampling: This method produced a lower balanced accuracy score of 54%, with precision of 99% and mediocre recall of 58%.
- Balanced Random Forest Classifier: This algorithm saw a higher balanced accuracy score of 79%, with precision of 99% and high recall of 87%.
- Easy Ensemble AdaBoost Classifier: This approach had the best overall results, with a the highest balanced accuracy score of 93%, precision of 99%, and highest recall of 94%.
This project utilized six different algorithms for predicting credit risk, and all six were hampered by the severe imbalance between the number of good loans and the number of risky ones. This is demonstrated by the very low precision scores for identifying high risk loans across all models. The Cluster Centroids approach fared the worst of the six, and the two ensemble approaches scored the best. The Easy Ensemble AdaBoost Classifier is the obvious choice when making a recommendation among these models given the dataset, as it performed most effectively overall.