MyHateDetect is a dual-stage classification platform for detecting and categorising hate speech in English and Malay texts. Built with a focus on code-switched social media texts (tweets), the system uses multilingual BERT (mBERT) to ensure high accuracy across diverse linguistic contexts.
The binary classification model (Stage 1) and the underlying dataset used in this project have been formally published:
- Research Paper: A bilingual Malay-English social media dataset for binary hate speech detection (Published in Data in Brief, ScienceDirect).
- Official Dataset: A Bilingual Malay-English Social Media Dataset for Binary Hate Speech Detection (Hosted on Mendeley Data).
Stage 1: binary classification (hate vs non-hate): Determine if a tweet contains hate speech.
Stage 2: multilabel hate type classification: Identify the specific nature of the hate speech (Race, Religion, Gender, Sexual Orientation).
The system is fine-tuned over 10,000 bilingual tweets. mBERT is selected for deployment due to its performances in both stages.
Stage 1
Shows high True Positive counts, effectively filtering toxic content.

Stage 2
Reveals how different categories, such as Race and Religion, often overlap in toxic discourse.

- Dual-Stage Pipeline: Optimised detection architecture. Stage 1 acts as a filter (Hate vs. Non-Hate), while Stage 2 performs deep-dive categorisation.
- Multilingual BERT (mBERT) Integration: Specifically fine-tuned for high accuracy in both English and Malay, outperforming standard monolingual models.
- Role-Based Access Control (RBAC): Secure access for Admins (system management and CSV uploads) and Policymakers (read-only visualisation and trend analysis).
- Zero-Config NLP: Automated NLTK resource setup on first launch.
MyHateDetect/
├── app/
│ ├── templates/ # HTML for dashboard, visualisation, auth
│ ├── routes/ # Flask Blueprints
│ ├── static/ # Logo used in UI
│ ├── stage_predict.py # Final prediction script (stage 1 + 2)
│ ├── text_utils.py # Preprocessing & Auto-NLTK setup
│ └── utils.py # Progress bar, database functions
├── experiment/
│ ├── stage1/ # Binary classification: Training notebooks, model weights & performance visuals
│ ├── stage2/ # Multi-label classification: Training notebooks, model weights & visuals
├── sample_uploads # Sample dataset files for tweets upload and user registration use
├── slangdict # Dictionary for normalising slang and toxic
├── sql query/
│ └── myhatedetect.sql # MySQL database dump
├── requirements.txt # Clean list of dependencies for website
├── run.py # Entry point for Flask app
└── README.md # Project documentation and setup guide
1. Clone the Repository
git clone https://github.com/JunTan03/FYP-MyHateDetect.git
cd FYP-MyHateDetect2. Install Dependencies
pip install -r requirements.txt<<<<<<< HEAD
3. Model Weights
Due to GitHub's file size limitations (100MB), the fine-tuned BERT models (model.safetensors) are not included in this repository.
To run the prediction pipeline locally:
- Download the model weights from [https://drive.google.com/drive/folders/11DBAdZg2rDUveGkMJe-EDdC8tza94gKo?usp=drive_link].
- Place the files into the following directory:
experiment/stage1/s1_mb_model/experiment/stage2/s2_mb_model/.
- Ensure
config.jsonandtokenizer_config.jsonare also present in the same folder.
3. Database Setup
- Ensure MySQL is running amd import the database dump
- Run this command to create the database
mysql -u root -p -e "CREATE DATABASE IF NOT EXISTS myhatedetect;"- Populate the database using the provided SQL dump ======= 3. Database Setup
-
Ensure MySQL is running amd import the database dump
-
Run this command to create the database
mysql -u root -p -e "CREATE DATABASE IF NOT EXISTS myhatedetect;"- Populate the database using the provided SQL dump
d8b64f8dc48e3037702599efd0c9b6476861e2e0
mysql -u root -p myhatedetect < "sql query/myhatedetect.sql"4. Run the Application
python run.pyVisit: http://localhost:5000
| Role | Password | |
|---|---|---|
| Admin | jtan4148@gmail.com | 12345678 |
| Policymaker | (Create via Admin) |
- Column must be
textortweet - Duplicate
file_nameandmonthwill be skipped - All inputs cleaned and language-detected
This project is licensed under the MIT License - see the LICENSE file for details.