A collection of utility functions and tools for Natural Language Processing tasks.
This repository provides a set of reusable NLP utilities designed to simplify common text processing tasks. It serves as a personal toolkit for various NLP projects, offering modular, well-documented functions that can be easily integrated into larger applications.
- Text Preprocessing: Cleaning, tokenization, normalization
- Feature Extraction: TF-IDF, word embeddings, n-grams
- Text Analysis: Sentiment analysis, keyword extraction, text statistics
- Data Utilities: Data loading, format conversion, batch processing
- Visualization: Text visualization tools
NLP_utils/
├── README.md # This file
├── nlp_utils/ # Main package
│ ├── __init__.py
│ ├── preprocessing.py # Text cleaning and preprocessing
│ ├── tokenization.py # Tokenization utilities
│ ├── embeddings.py # Word embedding tools
│ ├── features.py # Feature extraction
│ ├── similarity.py # Text similarity measures
│ └── visualization.py # Text visualization
├── tests/ # Unit tests
│ ├── __init__.py
│ └── test_*.py
├── examples/ # Usage examples
│ └── example_usage.py
├── docs/ # Documentation
├── requirements.txt # Dependencies
└── setup.py # Package setup
- Python 3.6+
- Dependencies: numpy, pandas, scikit-learn, nltk, spaCy
- Clone the repository:
git clone <repository-url>
cd NLP_utils- Install the package:
pip install -e .Or install dependencies manually:
pip install -r requirements.txtfrom nlp_utils.preprocessing import clean_text, normalize_text
# Clean text
text = "Your raw text here! Check out https://example.com"
cleaned = clean_text(text, remove_urls=True, remove_punctuation=True)
# Normalize text
normalized = normalize_text(cleaned, lowercase=True, lemmatize=True)from nlp_utils.tokenization import word_tokenize, sentence_tokenize
words = word_tokenize("This is a sample sentence.")
sentences = sentence_tokenize("First sentence. Second sentence.")from nlp_utils.features import extract_tfidf, extract_ngrams
# TF-IDF vectorization
vectorizer, vectors = extract_tfidf(documents, max_features=1000)
# N-gram extraction
ngrams = extract_ngrams(text, n=2)from nlp_utils.similarity import cosine_similarity, jaccard_similarity
similarity = cosine_similarity(text1, text2)Functions for cleaning and normalizing text:
clean_text(): Remove unwanted characters, URLs, emailsnormalize_text(): Lowercase, lemmatize, stemremove_stopwords(): Filter out common stopwordsfix_encoding(): Handle encoding issues
Tokenization utilities:
word_tokenize(): Split text into wordssentence_tokenize(): Split text into sentencescharacter_tokenize(): Character-level tokenizationsubword_tokenize(): BPE or WordPiece tokenization
Word embedding tools:
load_glove(): Load GloVe embeddingsload_word2vec(): Load Word2Vec modelget_embeddings(): Extract embeddings for textcompute_similarity_matrix(): Word similarity matrix
Feature extraction methods:
extract_tfidf(): TF-IDF featuresextract_ngrams(): N-gram featuresextract_pos_tags(): Part-of-speech featuresextract_named_entities(): NER features
Text similarity measures:
cosine_similarity(): Cosine similarityjaccard_similarity(): Jaccard similaritylevenshtein_distance(): Edit distancesemantic_similarity(): Embedding-based similarity
Run tests with pytest:
pytest tests/Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-utility) - Commit your changes (
git commit -am 'Add new utility') - Push to the branch (
git push origin feature/new-utility) - Create a Pull Request
- Add support for multilingual text processing
- Integrate transformer-based utilities
- Add benchmarking tools
- Create comprehensive documentation
- Add Jupyter notebook tutorials
Core dependencies:
- numpy
- pandas
- scikit-learn
- nltk
- spacy
- gensim
Optional dependencies:
- transformers (for BERT utilities)
- matplotlib (for visualization)
- jieba (for Chinese text processing)
[License information to be added]
For questions or suggestions, please open an issue on this repository.
- NLTK and spaCy communities for excellent NLP tools
- scikit-learn for machine learning utilities