Curated and processed Pashto datasets published by ZamAI Labs.
This repository includes custom cleaning, normalization, and consolidation work, plus documentation and attribution to original sources.
- Website: https://zamai.dev
- ZamAI Labs: https://github.com/ZamAI-ORG
- Repo: https://github.com/ZamAI-ORG/pashto-datasets
DATASETS/contains dataset folders. Each dataset includes:SOURCE.md(where it came from)LICENSE.md(original license or terms, if applicable)raw/(optional; small samples for testing when permitted)processed/(normalized/cleaned outputs ready for training)notes.md(what ZamAI changed)
All datasets in this repository originate from sources that allow redistribution. Each dataset folder includes source links and license/terms to preserve attribution and compliance.
Some datasets include small raw/ samples used for testing and validation. Processed datasets live under processed/.