Skip to content

sai-kumar-dev/Credit-Risk-ML-Benchmark

Repository files navigation

Credit Risk Prediction — End-to-End ML Benchmark

Real-world machine learning project focused on loan default prediction, model benchmarking, explainability, and cost-sensitive decision making using large-scale lending data.

Built as a research-style pipeline aligned with government R&D and industry data science workflows.


Run the Notebook (Colab)

Open and run instantly in browser: https://colab.research.google.com/drive/1mMedWu2dLTOzfZq0OKNsfTpcge6vBF5W?usp=sharing


Project Structure

credit-risk-ml-benchmark/
│
│  Credit Risk Prediction Report.pdf
│  credit risk prediction.ipynb
│  requirements.txt
│
└── images/
    corr.png
    Credit History vs Default.png
    DTI Vs Default.png
    Interest_Rate_vs_Default.png
    Loan Amount vs Default.png
    ROC Curve Comparision.png
    ROC LOG.png
    SHAP.png
    Target Default Value Counts.png

Executive Summary

Goal: Predict probability of loan default using real lending data and evaluate models from both statistical and business perspectives.

Dataset:

  • 270k+ loan records
  • 73 engineered features
  • Binary target: Default vs Fully Paid

Models compared:

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • XGBoost

Best ROC-AUC: 0.75 (XGBoost) Best Business Model (Cost-Sensitive): Logistic Regression

Key takeaway:

The statistically best model is not always the best business model.


Problem Statement

Financial institutions must estimate the probability that a borrower will default.

Cost asymmetry:

  • Missing a defaulter → very expensive
  • Rejecting a safe borrower → less expensive

This project answers:

  • Which model predicts default best?
  • Which model minimizes financial risk?
  • Which features drive predictions?
  • How do we move from ML metrics → real decisions?

Dataset Overview

After cleaning and feature engineering:

  • Rows: 270,385

  • Features: 73

  • Class distribution:

    • Fully Paid → 79.8%
    • Default → 20.2%

Realistic class imbalance → real industry scenario.


Exploratory Data Analysis

Target Distribution

Target

Interest Rate vs Default

Interest

Loan Amount vs Default

Loan

Debt-to-Income vs Default

DTI

Credit History vs Default

History

Correlation Overview

Corr

Key insights:

  • Higher interest → higher risk
  • Short credit history → higher default probability
  • Higher DTI → increased risk
  • Longer loan term → increased risk

Model Benchmarking

Model ROC-AUC
XGBoost 0.750
Logistic Regression 0.743
Random Forest 0.734
Decision Tree 0.729

ROC Curve Comparison

ROC


Threshold Tuning (Real-World Evaluation)

Models evaluated using a risk-aware threshold instead of default 0.5.

Model Accuracy Recall(Default) Precision(Default)
Logistic Regression 0.68 0.68 0.35
Decision Tree 0.76 0.45 0.41
Random Forest 0.77 0.44 0.43
XGBoost 0.77 0.46 0.44

Cost-Sensitive Analysis

Assumption: Missing a defaulter costs 5× more than rejecting a safe borrower.

Result: Logistic Regression minimized total financial loss.

Industry lesson: Accuracy ≠ Best deployment model.


Explainable AI

Feature Importance (SHAP)

SHAP

Top risk drivers:

  • Loan grade
  • Interest rate
  • Loan term
  • FICO score
  • Credit history length
  • Mortgage accounts

Interpretability is essential for:

  • Financial regulation
  • Risk audits
  • Responsible AI

Tech Stack

  • Python
  • Pandas / NumPy
  • Scikit-learn
  • XGBoost
  • Matplotlib / Seaborn
  • SHAP

Run Locally

Clone repo and install dependencies:

pip install -r requirements.txt
jupyter notebook

Open:

credit risk prediction.ipynb

Run all cells to reproduce results.


Why This Project Matters

Most ML projects stop at accuracy.

This project goes further:

  • Realistic dataset scale
  • End-to-end ML pipeline
  • Model benchmarking
  • Threshold tuning
  • Cost-sensitive evaluation
  • Explainable AI
  • Business-driven model selection

Designed to reflect real industry and government research workflows.

Contributors