Predicting corporate financial defaults using machine learning and analyzing Indian equity portfolio risk through statistical return and volatility modeling.
- Project Overview
- Business Problem
- Dataset
- Methodology
- Key Results
- Business Impact
- Skills
- Key Learnings
- Future Improvements
- Repository Structure
- Author
This project is a comprehensive, dual-part financial analytics solution designed to address two critical challenges in modern finance:
- Part A — Building a machine learning-powered Financial Health Assessment Tool to predict whether a company will default (negative net worth) in the following year, using balance sheet metrics spanning 4,256 companies and 51 financial features.
- Part B — Conducting a rigorous Market Risk Analysis on a portfolio of five Indian stocks over 8 years (418 weeks) to assess risk-return tradeoffs and support data-driven investment decisions.
Together, these analyses form an end-to-end framework for financial risk intelligence — from credit risk modeling to equity portfolio management.
👉 Open the notebook to explore full analysis
In the modern financial landscape, businesses and investors urgently need reliable mechanisms to assess the creditworthiness of companies before making investment or lending decisions. Traditional manual assessments are slow, subjective, and prone to error.
A group of venture capitalists commissioned the development of a Financial Health Assessment Tool — an automated system powered by machine learning to:
- Identify companies at risk of defaulting on their financial obligations (net worth turning negative in the next fiscal year)
- Evaluate credit risk exposure through liquidity ratios, debt-to-equity ratios, and profitability metrics
- Enable proactive risk mitigation strategies before financial distress occurs
Stakeholders: Venture capitalists, credit analysts, institutional investors, risk management teams
Decision Impact: Informed go/no-go investment decisions, portfolio risk hedging, early warning systems for financial distress
Investors face market risk arising from asset price fluctuations driven by economic events, geopolitical developments, and shifting investor sentiment. Quantifying this risk using historical data is essential for building resilient portfolios.
The objective is to analyze a portfolio of five Indian stocks using statistical risk-return modeling to guide allocation decisions and optimize risk-adjusted returns.
Stakeholders: Individual investors, portfolio managers, financial advisors
Decision Impact: Portfolio allocation strategy, risk categorization of stocks, stop-loss implementation, rebalancing triggers
| Attribute | Details |
|---|---|
| Source | Balance sheet financial metrics of Indian companies |
| Records | 4,256 companies |
| Features | 51 columns (all numeric — float64 / int64) |
| Target Variable | Derived: default = 1 if Networth Next Year < 0, else 0 |
| Class Distribution | Non-Defaulters: 79% (3,352) · Defaulters: 21% (904) |
| Missing Values | 17,778 missing entries across multiple columns |
Key Features Include: Total Assets, Net Worth, Total Income, Profit After Tax (PAT), PBDITA, PBT, Borrowings, Debt-to-Equity Ratio, Current Ratio, Quick Ratio, Shareholders' Funds, Cumulative Retained Profits, TOL/TNW, EPS, Adjusted EPS, and 35+ additional financial indicators.
| Attribute | Details |
|---|---|
| Source | Weekly closing prices of five Indian stocks |
| Records | 418 weeks (~8 years) |
| Features | Date + 5 stock columns |
| Stocks Covered | ITC Limited, Bharti Airtel, Tata Motors, DLF Limited, Yes Bank |
| Missing Values | None |
| Data Type | 5 numerical (int64), 1 datetime (object → converted) |
The dataset was loaded and inspected for structure, data types, and completeness. The dataset contains 4,256 records across 51 columns — all numeric except the index. The target variable default was engineered from Networth_Next_Year: companies with negative net worth next year were labeled as defaulters (1), and the rest as non-defaulters (0), yielding a class imbalance of 79:21.
Univariate Analysis: Box plots were generated for all 51 numerical features, revealing significant right-skewness across most financial variables including Total Assets, Net Worth, Total Income, and Borrowings. Negative values were observed in profitability ratios (PBT%, PAT%, Cash Profit%), indicating loss-making companies. Histograms of key financial variables confirmed that most companies operate at lower income levels, while a small subset reports extreme values — potential outlier candidates.
Bivariate Analysis: A full correlation heatmap was constructed across all numerical variables, followed by a focused heatmap on key financial variables. Notable correlations identified:
- Net Worth & Total Assets: 0.89 — asset-rich companies maintain stronger net worth
- Total Income & Profit After Tax: 0.78 — higher revenue correlates with higher profitability
- Debt-to-Equity Ratio vs. Default: +0.42 — higher leverage increases default likelihood
- Net Worth vs. Default: −0.65 — strong negative indicator of financial instability
Outlier Treatment: IQR method was applied to each column; outliers were replaced with NaN to preserve row count and avoid data loss.
Missing Value Analysis: Columns with more than 30% missing values were identified and dropped: PE_on_BSE (67%), Investments (51%), Other_income (46%), Contingent_liabilities (42%), Deferred_tax_liability (42%), Income_from_financial_services (38%), and Change_in_stock (31%).
Missing Value Imputation: Remaining missing values were imputed using K-Nearest Neighbour (KNN) Imputation to preserve the statistical relationships between features — a critical decision since dropping rows would have eliminated over 65% of actual defaulters.
Train-Test Split & Scaling: Data was split 70:30 (train:test). Features were normalized using StandardScaler to prepare for logistic regression modeling.
Two classification models were built with class-weighting to handle the target imbalance:
- Logistic Regression (via
statsmodelsLogit) — Interpretable baseline model - Random Forest Classifier — Ensemble model for capturing non-linear patterns
Multicollinearity Treatment (VIF): Variance Inflation Factors were computed; features with VIF > 5 were identified and iteratively removed to produce reliable coefficients. Highly collinear features included Total Assets (VIF: ∞), Total Liabilities (VIF: ∞), Sales (92.34), Total Income (89.03), and Total Expenses (52.13).
ROC Curve & Threshold Optimization: The ROC curve was plotted to determine the optimal classification threshold for the logistic regression model. The resulting AUC of 0.59 confirmed that logistic regression had weak discriminatory ability — only marginally better than random classification.
Hyperparameter Tuning (RandomSearchCV): The Random Forest model was tuned using RandomizedSearchCV to find optimal hyperparameters, improving test recall from 5% (baseline) to 30% (tuned) while reducing overfitting.
Models were evaluated using Accuracy, Recall, Precision, F1-Score, and AUC-ROC on both training and test sets.
Weekly stock price data for five Indian companies was loaded covering 418 weeks. The Date column was converted from object type to proper datetime format. No missing or duplicate values were found.
Descriptive statistics were computed for each stock. Bharti Airtel showed the highest average price (₹528.26), while Yes Bank had the lowest median price (₹30) — indicating a significant long-term price decline. Bharti Airtel and Tata Motors exhibited the highest standard deviations, while ITC Limited and Bharti Airtel were comparatively stable.
A time-series line chart was plotted for all five stocks from 2016 to 2024. Key observations: Bharti Airtel and Tata Motors showed strong upward trends post-2020; Yes Bank exhibited a dramatic collapse and sustained low prices; Tata Motors and DLF Limited showed recovery trajectories post-COVID dip.
Weekly percentage returns were computed for each stock using .pct_change(). Mean returns and standard deviations were calculated and tabulated:
- Highest mean return: DLF Limited (0.004863), Bharti Airtel (0.003271)
- Negative mean return: Yes Bank (−0.004737)
- Highest volatility: Yes Bank (std: 0.093879)
- Most stable: ITC Limited (std: 0.035904)
A Net Return vs. Volatility scatter plot was constructed to visually map each stock's risk-return position, clearly separating defensive stocks from aggressive and underperforming ones.
| Model | Accuracy | Recall | Precision | F1 |
|---|---|---|---|---|
| Logistic Regression (Baseline) | 0.79 | 0.01 | 0.60 | 0.02 |
| Tuned Logistic Regression | 0.79 | 0.01 | 0.67 | 0.01 |
| Random Forest (Baseline) | 0.67 | 0.05 | 0.08 | 0.06 |
| Tuned Random Forest ✅ | 0.72 | 0.30 | 0.32 | 0.31 |
Final Model Selected: Tuned Random Forest — Best balance of recall (30%) and precision (32%) on the test set with minimal overfitting.
Key Insight: The dataset's class imbalance (79:21) made recall the critical performance metric — missing an actual defaulter is far costlier than a false alarm in financial risk contexts.
| Stock | Mean Weekly Return | Std Deviation | Investor Profile |
|---|---|---|---|
| ITC Limited | 0.001634 | 0.035904 | Risk-Averse |
| Bharti Airtel | 0.003271 | 0.038728 | Risk-Averse / Moderate |
| Tata Motors | 0.002234 | 0.060484 | Growth-Oriented |
| DLF Limited | 0.004863 | 0.057785 | Growth-Oriented |
| Yes Bank | −0.004737 | 0.093879 | High-Risk / Speculative |
- Automate Credit Screening: Deploy the Tuned Random Forest model as an automated early-warning system to flag high-risk companies before lending or investment decisions are made.
- Prioritize High-Risk Features: Focus credit review processes on companies with high debt-to-equity ratios, negative PAT, and declining net worth — the strongest predictors of default.
- Dynamic Monitoring: Implement quarterly re-scoring of portfolio companies using updated financial statements to track drift toward distress.
- Capital Allocation Optimization: Use default probability scores to weight portfolio positions — reducing exposure to companies with high predicted default likelihood.
- Debt Restructuring Triggers: Use model outputs to proactively initiate debt renegotiation conversations with at-risk borrowers before default occurs.
- Stable Core Holdings: Allocate the majority of conservative portfolios to ITC Limited and Bharti Airtel — lower volatility, consistent positive returns.
- Growth Allocation: Include DLF Limited and Tata Motors for return-seeking investors with appropriate risk tolerance and a long investment horizon.
- Avoid or Limit Yes Bank Exposure: Negative mean return and extreme volatility make Yes Bank unsuitable for risk-averse strategies; treat only as a speculative, minimal position.
- Stop-Loss Implementation: Apply stop-loss rules for high-volatility holdings (Yes Bank, Tata Motors) to cap downside risk.
- Periodic Rebalancing: Reassess and rebalance portfolio allocations quarterly using updated return and volatility metrics.
- Class imbalance is a silent model killer — in financial default prediction, accuracy alone is deeply misleading; recall must be the primary optimization target to avoid missing actual defaulters
- KNN Imputation over row-dropping — naive row deletion would have eliminated 65%+ of defaulters; KNN imputation preserved data integrity while handling missing values responsibly
- VIF-based feature selection matters beyond just p-values — multicollinearity inflates p-values and masks the true significance of predictors in logistic regression; iterative VIF removal was essential for reliable inference
- Hyperparameter tuning reduces overfitting — the baseline Random Forest had a 56-point recall gap between train (61%) and test (5%); tuning reduced this to a 17-point gap, dramatically improving generalization
- Risk and return must always be evaluated together — a stock's return in isolation is meaningless; the Sharpe-like analysis combining mean returns and standard deviation revealed that Yes Bank's negative returns with extreme volatility made it categorically different from all other portfolio assets
- Address Class Imbalance with SMOTE / ADASYN — Apply oversampling techniques on the minority class (defaulters) during training to further improve recall without sacrificing precision
- Ensemble Stacking — Combine Logistic Regression, Random Forest, and Gradient Boosting (XGBoost/LightGBM) in a stacked ensemble to capture both linear and non-linear patterns for improved default prediction
- Time-Series Credit Scoring — Incorporate multi-year financial data per company to model trend-based deterioration rather than a single-year snapshot
- Value at Risk (VaR) Modeling — Extend the market risk analysis with Monte Carlo simulation and Historical VaR to provide confidence-interval-based loss estimates for the equity portfolio
- Interactive Dashboard Deployment — Build a Streamlit or Dash web application to allow investors and analysts to dynamically input financial metrics and receive real-time default probability scores and portfolio risk assessments
finance-retail-analytics-using-python/
│
├── data/
│ ├── Comp_Fin_Data.csv # Corporate financial dataset (Part A)
│ └── Market_Risk_Data.csv # Stock price dataset (Part B)
│
├── notebook/
│ └── finance-retail-analytics-using-python.ipynb # Analysis notebook
│
├── requirements.txt # Project dependencies
├── README.md # Project documentation
├── LICENSE # License file
└── .gitignore # Git ignore file
Nabankur Ray
Passionate about real-world data-driven solutions
⭐ If you like this project — give it a ⭐ on GitHub — it helps a lot!