ClusterCache

Semantic Search System with Fuzzy Clustering and Intelligent Query Caching

ClusterCache is a lightweight semantic search system built on top of the 20 Newsgroups dataset.
It combines vector embeddings, fuzzy clustering, and a custom semantic cache to efficiently retrieve semantically relevant documents while avoiding redundant computations.

System Architecture

The system consists of four major components:

Embedding & Vector Database
Fuzzy Clustering
Semantic Cache
FastAPI Service

User Query
   │
   ▼
Embedding Model
   │
   ▼
Semantic Cache Check
   │
   ├── Cache Hit → Return cached result
   │
   └── Cache Miss
           │
           ▼
    Vector Search (FAISS)
           │
           ▼
    Cluster Analysis
           │
           ▼
    Store Result in Cache
           │
           ▼
         Response

Dataset

This system uses the 20 Newsgroups dataset, which contains approximately 20,000 documents across 20 topic categories.

Dataset link: https://archive.ics.uci.edu/dataset/113/twenty+newsgroups

Extract into: data/20_newsgroups/

The dataset is preprocessed by removing:

Email headers
Footers
Quoted text

This improves the quality of semantic embeddings by reducing noise.

Technologies Used

Component	Technology
Embeddings	Sentence Transformers (MiniLM)
Vector Database	FAISS
Clustering	Gaussian Mixture Model (Soft Clustering)
API Framework	FastAPI
Similarity	Cosine Similarity
Language	Python

Design Decisions

Embedding Model

The model all-MiniLM-L6-v2 from Sentence Transformers was chosen because:

Produces strong semantic embeddings
Lightweight and fast
Suitable for semantic search systems
384-dimensional embeddings provide a good balance of accuracy and speed

Vector Database

FAISS was selected for vector storage because:

Efficient similarity search
Optimized for large embedding datasets
Widely used in production ML systems

The FAISS IndexFlatL2 index is used for simplicity and accuracy.

Fuzzy Clustering

Instead of assigning each document to a single cluster, we use Gaussian Mixture Models (GMM) to generate probabilistic cluster memberships.

This enables:

Documents belonging to multiple semantic topics
Boundary analysis between clusters
More realistic representation of topic overlap

Example:

Document: "Gun control legislation debate"

Cluster probabilities:
Politics → 0.52
Firearms → 0.37
Law → 0.11

Semantic Cache

Traditional caches fail when queries are phrased differently.

Example:

Query 1: "How does the space shuttle work?"
Query 2: "Explain space shuttle technology"

A traditional cache treats them as different.

Our semantic cache:

Embeds each query
Computes cosine similarity with cached queries
If similarity > threshold → cache hit

This allows the system to reuse previously computed results.

Cache Threshold

The cache similarity threshold determines whether two queries are considered equivalent.

Lower threshold:

More cache hits
Risk of irrelevant matches

Higher threshold:

More accurate
Fewer cache hits

This parameter directly affects system behaviour and efficiency.

API Endpoints

POST /query

Submit a natural language query.

Example request:

POST /query

Body:

{
  "query": "space shuttle technology"
}

Example response:

{
  "query": "space shuttle technology",
  "cache_hit": false,
  "matched_query": null,
  "similarity_score": null,
  "result": ["document1", "document2"],
  "dominant_cluster": 3
}

GET /cache/stats

Returns cache statistics.

Example:

{
  "total_entries": 42,
  "hit_count": 17,
  "miss_count": 25,
  "hit_rate": 0.405
}

DELETE /cache

Clears the cache and resets statistics.

Installation

Clone the repository

git clone https://github.com/nethra0906/ClusterCache.git
cd ClusterCache

Create virtual environment

python -m venv venv

Activate

Windows:

venv\Scripts\activate

Mac/Linux:

source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Running the API

Start the FastAPI server

uvicorn main:app --reload

Server will run at

http://127.0.0.1:8000

API documentation available at

http://127.0.0.1:8000/docs

Example Queries

Try queries like:

"space shuttle technology"
"gun control debate"
"computer graphics rendering"
"baseball statistics"

Project Structure

semantic-search-system
│
├── api
│   └── service.py
│
├── cache
│   └── semantic_cache.py
│
├── clustering
│   └── fuzzy_cluster.py
│
├── embeddings
│   ├── generate_embeddings.py
│   └── vector_store.py
│
├── utils
│   ├── load_dataset.py
│   └── query_engine.py
│
├── data
│
├── main.py
├── req.txt
└── README.md

Possible Future Improvements

Cluster-aware cache lookup
HNSW indexing for faster vector retrieval
Query intent detection
Cluster visualization using t-SNE or UMAP
Distributed vector storage

Author

Nethra Krishnan
B.Tech Computer Science (Data Science)
VIT Vellore

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
semantic-search-system		semantic-search-system
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusterCache

Semantic Search System with Fuzzy Clustering and Intelligent Query Caching

System Architecture

Dataset

Technologies Used

Design Decisions

Embedding Model

Vector Database

Fuzzy Clustering

Semantic Cache

Cache Threshold

API Endpoints

POST /query

GET /cache/stats

DELETE /cache

Installation

Running the API

Example Queries

Project Structure

Possible Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClusterCache

Semantic Search System with Fuzzy Clustering and Intelligent Query Caching

System Architecture

Dataset

Technologies Used

Design Decisions

Embedding Model

Vector Database

Fuzzy Clustering

Semantic Cache

Cache Threshold

API Endpoints

POST /query

GET /cache/stats

DELETE /cache

Installation

Running the API

Example Queries

Project Structure

Possible Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages