Credit Risk Prediction — End-to-End ML Benchmark

Real-world machine learning project focused on loan default prediction, model benchmarking, explainability, and cost-sensitive decision making using large-scale lending data.

Built as a research-style pipeline aligned with government R&D and industry data science workflows.

Run the Notebook (Colab)

Open and run instantly in browser: https://colab.research.google.com/drive/1mMedWu2dLTOzfZq0OKNsfTpcge6vBF5W?usp=sharing

Project Structure

credit-risk-ml-benchmark/
│
│  Credit Risk Prediction Report.pdf
│  credit risk prediction.ipynb
│  requirements.txt
│
└── images/
    corr.png
    Credit History vs Default.png
    DTI Vs Default.png
    Interest_Rate_vs_Default.png
    Loan Amount vs Default.png
    ROC Curve Comparision.png
    ROC LOG.png
    SHAP.png
    Target Default Value Counts.png

Executive Summary

Goal: Predict probability of loan default using real lending data and evaluate models from both statistical and business perspectives.

Dataset:

270k+ loan records
73 engineered features
Binary target: Default vs Fully Paid

Models compared:

Logistic Regression
Decision Tree
Random Forest
XGBoost

Best ROC-AUC: 0.75 (XGBoost) Best Business Model (Cost-Sensitive): Logistic Regression

Key takeaway:

The statistically best model is not always the best business model.

Problem Statement

Financial institutions must estimate the probability that a borrower will default.

Cost asymmetry:

Missing a defaulter → very expensive
Rejecting a safe borrower → less expensive

This project answers:

Which model predicts default best?
Which model minimizes financial risk?
Which features drive predictions?
How do we move from ML metrics → real decisions?

Dataset Overview

After cleaning and feature engineering:

Rows: 270,385
Features: 73
Class distribution:
- Fully Paid → 79.8%
- Default → 20.2%

Realistic class imbalance → real industry scenario.

Exploratory Data Analysis

Target Distribution

Interest Rate vs Default

Loan Amount vs Default

Debt-to-Income vs Default

Credit History vs Default

Correlation Overview

Key insights:

Higher interest → higher risk
Short credit history → higher default probability
Higher DTI → increased risk
Longer loan term → increased risk

Model Benchmarking

Model	ROC-AUC
XGBoost	0.750
Logistic Regression	0.743
Random Forest	0.734
Decision Tree	0.729

ROC Curve Comparison

Threshold Tuning (Real-World Evaluation)

Models evaluated using a risk-aware threshold instead of default 0.5.

Model	Accuracy	Recall(Default)	Precision(Default)
Logistic Regression	0.68	0.68	0.35
Decision Tree	0.76	0.45	0.41
Random Forest	0.77	0.44	0.43
XGBoost	0.77	0.46	0.44

Cost-Sensitive Analysis

Assumption: Missing a defaulter costs 5× more than rejecting a safe borrower.

Result: Logistic Regression minimized total financial loss.

Industry lesson: Accuracy ≠ Best deployment model.

Explainable AI

Feature Importance (SHAP)

Top risk drivers:

Loan grade
Interest rate
Loan term
FICO score
Credit history length
Mortgage accounts

Interpretability is essential for:

Financial regulation
Risk audits
Responsible AI

Tech Stack

Python
Pandas / NumPy
Scikit-learn
XGBoost
Matplotlib / Seaborn
SHAP

Run Locally

Clone repo and install dependencies:

pip install -r requirements.txt
jupyter notebook

Open:

credit risk prediction.ipynb

Run all cells to reproduce results.

Why This Project Matters

Most ML projects stop at accuracy.

This project goes further:

Realistic dataset scale
End-to-end ML pipeline
Model benchmarking
Threshold tuning
Cost-sensitive evaluation
Explainable AI
Business-driven model selection

Designed to reflect real industry and government research workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Risk Prediction — End-to-End ML Benchmark

Run the Notebook (Colab)

Project Structure

Executive Summary

Problem Statement

Dataset Overview

Exploratory Data Analysis

Target Distribution

Interest Rate vs Default

Loan Amount vs Default

Debt-to-Income vs Default

Credit History vs Default

Correlation Overview

Model Benchmarking

ROC Curve Comparison

Threshold Tuning (Real-World Evaluation)

Cost-Sensitive Analysis

Explainable AI

Feature Importance (SHAP)

Tech Stack

Run Locally

Why This Project Matters

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
Credit Risk Prediction Report.pdf		Credit Risk Prediction Report.pdf
README.md		README.md
credit risk prediction.ipynb		credit risk prediction.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Credit Risk Prediction — End-to-End ML Benchmark

Run the Notebook (Colab)

Project Structure

Executive Summary

Problem Statement

Dataset Overview

Exploratory Data Analysis

Target Distribution

Interest Rate vs Default

Loan Amount vs Default

Debt-to-Income vs Default

Credit History vs Default

Correlation Overview

Model Benchmarking

ROC Curve Comparison

Threshold Tuning (Real-World Evaluation)

Cost-Sensitive Analysis

Explainable AI

Feature Importance (SHAP)

Tech Stack

Run Locally

Why This Project Matters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages