ClusterCache is a lightweight semantic search system built on top of the 20 Newsgroups dataset.
It combines vector embeddings, fuzzy clustering, and a custom semantic cache to efficiently retrieve semantically relevant documents while avoiding redundant computations.
The system consists of four major components:
- Embedding & Vector Database
- Fuzzy Clustering
- Semantic Cache
- FastAPI Service
User Query
│
▼
Embedding Model
│
▼
Semantic Cache Check
│
├── Cache Hit → Return cached result
│
└── Cache Miss
│
▼
Vector Search (FAISS)
│
▼
Cluster Analysis
│
▼
Store Result in Cache
│
▼
Response
This system uses the 20 Newsgroups dataset, which contains approximately 20,000 documents across 20 topic categories.
Dataset link: https://archive.ics.uci.edu/dataset/113/twenty+newsgroups
Extract into: data/20_newsgroups/
The dataset is preprocessed by removing:
- Email headers
- Footers
- Quoted text
This improves the quality of semantic embeddings by reducing noise.
| Component | Technology |
|---|---|
| Embeddings | Sentence Transformers (MiniLM) |
| Vector Database | FAISS |
| Clustering | Gaussian Mixture Model (Soft Clustering) |
| API Framework | FastAPI |
| Similarity | Cosine Similarity |
| Language | Python |
The model all-MiniLM-L6-v2 from Sentence Transformers was chosen because:
- Produces strong semantic embeddings
- Lightweight and fast
- Suitable for semantic search systems
- 384-dimensional embeddings provide a good balance of accuracy and speed
FAISS was selected for vector storage because:
- Efficient similarity search
- Optimized for large embedding datasets
- Widely used in production ML systems
The FAISS IndexFlatL2 index is used for simplicity and accuracy.
Instead of assigning each document to a single cluster, we use Gaussian Mixture Models (GMM) to generate probabilistic cluster memberships.
This enables:
- Documents belonging to multiple semantic topics
- Boundary analysis between clusters
- More realistic representation of topic overlap
Example:
Document: "Gun control legislation debate"
Cluster probabilities:
Politics → 0.52
Firearms → 0.37
Law → 0.11
Traditional caches fail when queries are phrased differently.
Example:
Query 1: "How does the space shuttle work?"
Query 2: "Explain space shuttle technology"
A traditional cache treats them as different.
Our semantic cache:
- Embeds each query
- Computes cosine similarity with cached queries
- If similarity > threshold → cache hit
This allows the system to reuse previously computed results.
The cache similarity threshold determines whether two queries are considered equivalent.
Lower threshold:
- More cache hits
- Risk of irrelevant matches
Higher threshold:
- More accurate
- Fewer cache hits
This parameter directly affects system behaviour and efficiency.
Submit a natural language query.
Example request:
POST /query
Body:
{
"query": "space shuttle technology"
}Example response:
{
"query": "space shuttle technology",
"cache_hit": false,
"matched_query": null,
"similarity_score": null,
"result": ["document1", "document2"],
"dominant_cluster": 3
}Returns cache statistics.
Example:
{
"total_entries": 42,
"hit_count": 17,
"miss_count": 25,
"hit_rate": 0.405
}Clears the cache and resets statistics.
Clone the repository
git clone https://github.com/nethra0906/ClusterCache.git
cd ClusterCache
Create virtual environment
python -m venv venv
Activate
Windows:
venv\Scripts\activate
Mac/Linux:
source venv/bin/activate
Install dependencies
pip install -r requirements.txt
Start the FastAPI server
uvicorn main:app --reload
Server will run at
http://127.0.0.1:8000
API documentation available at
http://127.0.0.1:8000/docs
Try queries like:
"space shuttle technology"
"gun control debate"
"computer graphics rendering"
"baseball statistics"
semantic-search-system
│
├── api
│ └── service.py
│
├── cache
│ └── semantic_cache.py
│
├── clustering
│ └── fuzzy_cluster.py
│
├── embeddings
│ ├── generate_embeddings.py
│ └── vector_store.py
│
├── utils
│ ├── load_dataset.py
│ └── query_engine.py
│
├── data
│
├── main.py
├── req.txt
└── README.md
- Cluster-aware cache lookup
- HNSW indexing for faster vector retrieval
- Query intent detection
- Cluster visualization using t-SNE or UMAP
- Distributed vector storage
Nethra Krishnan
B.Tech Computer Science (Data Science)
VIT Vellore