Real-world machine learning project focused on loan default prediction, model benchmarking, explainability, and cost-sensitive decision making using large-scale lending data.
Built as a research-style pipeline aligned with government R&D and industry data science workflows.
Open and run instantly in browser: https://colab.research.google.com/drive/1mMedWu2dLTOzfZq0OKNsfTpcge6vBF5W?usp=sharing
credit-risk-ml-benchmark/
│
│ Credit Risk Prediction Report.pdf
│ credit risk prediction.ipynb
│ requirements.txt
│
└── images/
corr.png
Credit History vs Default.png
DTI Vs Default.png
Interest_Rate_vs_Default.png
Loan Amount vs Default.png
ROC Curve Comparision.png
ROC LOG.png
SHAP.png
Target Default Value Counts.png
Goal: Predict probability of loan default using real lending data and evaluate models from both statistical and business perspectives.
Dataset:
- 270k+ loan records
- 73 engineered features
- Binary target: Default vs Fully Paid
Models compared:
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost
Best ROC-AUC: 0.75 (XGBoost) Best Business Model (Cost-Sensitive): Logistic Regression
Key takeaway:
The statistically best model is not always the best business model.
Financial institutions must estimate the probability that a borrower will default.
Cost asymmetry:
- Missing a defaulter → very expensive
- Rejecting a safe borrower → less expensive
This project answers:
- Which model predicts default best?
- Which model minimizes financial risk?
- Which features drive predictions?
- How do we move from ML metrics → real decisions?
After cleaning and feature engineering:
-
Rows: 270,385
-
Features: 73
-
Class distribution:
- Fully Paid → 79.8%
- Default → 20.2%
Realistic class imbalance → real industry scenario.
Key insights:
- Higher interest → higher risk
- Short credit history → higher default probability
- Higher DTI → increased risk
- Longer loan term → increased risk
| Model | ROC-AUC |
|---|---|
| XGBoost | 0.750 |
| Logistic Regression | 0.743 |
| Random Forest | 0.734 |
| Decision Tree | 0.729 |
Models evaluated using a risk-aware threshold instead of default 0.5.
| Model | Accuracy | Recall(Default) | Precision(Default) |
|---|---|---|---|
| Logistic Regression | 0.68 | 0.68 | 0.35 |
| Decision Tree | 0.76 | 0.45 | 0.41 |
| Random Forest | 0.77 | 0.44 | 0.43 |
| XGBoost | 0.77 | 0.46 | 0.44 |
Assumption: Missing a defaulter costs 5× more than rejecting a safe borrower.
Result: Logistic Regression minimized total financial loss.
Industry lesson: Accuracy ≠ Best deployment model.
Top risk drivers:
- Loan grade
- Interest rate
- Loan term
- FICO score
- Credit history length
- Mortgage accounts
Interpretability is essential for:
- Financial regulation
- Risk audits
- Responsible AI
- Python
- Pandas / NumPy
- Scikit-learn
- XGBoost
- Matplotlib / Seaborn
- SHAP
Clone repo and install dependencies:
pip install -r requirements.txt
jupyter notebook
Open:
credit risk prediction.ipynb
Run all cells to reproduce results.
Most ML projects stop at accuracy.
This project goes further:
- Realistic dataset scale
- End-to-end ML pipeline
- Model benchmarking
- Threshold tuning
- Cost-sensitive evaluation
- Explainable AI
- Business-driven model selection
Designed to reflect real industry and government research workflows.







