Skip to content

nethra0906/ClusterCache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

ClusterCache

Semantic Search System with Fuzzy Clustering and Intelligent Query Caching

ClusterCache is a lightweight semantic search system built on top of the 20 Newsgroups dataset.
It combines vector embeddings, fuzzy clustering, and a custom semantic cache to efficiently retrieve semantically relevant documents while avoiding redundant computations.


System Architecture

The system consists of four major components:

  1. Embedding & Vector Database
  2. Fuzzy Clustering
  3. Semantic Cache
  4. FastAPI Service
User Query
   │
   ▼
Embedding Model
   │
   ▼
Semantic Cache Check
   │
   ├── Cache Hit → Return cached result
   │
   └── Cache Miss
           │
           ▼
    Vector Search (FAISS)
           │
           ▼
    Cluster Analysis
           │
           ▼
    Store Result in Cache
           │
           ▼
         Response

Dataset

This system uses the 20 Newsgroups dataset, which contains approximately 20,000 documents across 20 topic categories.

Dataset link: https://archive.ics.uci.edu/dataset/113/twenty+newsgroups

Extract into: data/20_newsgroups/

The dataset is preprocessed by removing:

  • Email headers
  • Footers
  • Quoted text

This improves the quality of semantic embeddings by reducing noise.


Technologies Used

Component Technology
Embeddings Sentence Transformers (MiniLM)
Vector Database FAISS
Clustering Gaussian Mixture Model (Soft Clustering)
API Framework FastAPI
Similarity Cosine Similarity
Language Python

Design Decisions

Embedding Model

The model all-MiniLM-L6-v2 from Sentence Transformers was chosen because:

  • Produces strong semantic embeddings
  • Lightweight and fast
  • Suitable for semantic search systems
  • 384-dimensional embeddings provide a good balance of accuracy and speed

Vector Database

FAISS was selected for vector storage because:

  • Efficient similarity search
  • Optimized for large embedding datasets
  • Widely used in production ML systems

The FAISS IndexFlatL2 index is used for simplicity and accuracy.


Fuzzy Clustering

Instead of assigning each document to a single cluster, we use Gaussian Mixture Models (GMM) to generate probabilistic cluster memberships.

This enables:

  • Documents belonging to multiple semantic topics
  • Boundary analysis between clusters
  • More realistic representation of topic overlap

Example:

Document: "Gun control legislation debate"

Cluster probabilities:
Politics → 0.52
Firearms → 0.37
Law → 0.11

Semantic Cache

Traditional caches fail when queries are phrased differently.

Example:

Query 1: "How does the space shuttle work?"
Query 2: "Explain space shuttle technology"

A traditional cache treats them as different.

Our semantic cache:

  1. Embeds each query
  2. Computes cosine similarity with cached queries
  3. If similarity > threshold → cache hit

This allows the system to reuse previously computed results.


Cache Threshold

The cache similarity threshold determines whether two queries are considered equivalent.

Lower threshold:

  • More cache hits
  • Risk of irrelevant matches

Higher threshold:

  • More accurate
  • Fewer cache hits

This parameter directly affects system behaviour and efficiency.


API Endpoints

POST /query

Submit a natural language query.

Example request:

POST /query

Body:

{
  "query": "space shuttle technology"
}

Example response:

{
  "query": "space shuttle technology",
  "cache_hit": false,
  "matched_query": null,
  "similarity_score": null,
  "result": ["document1", "document2"],
  "dominant_cluster": 3
}

GET /cache/stats

Returns cache statistics.

Example:

{
  "total_entries": 42,
  "hit_count": 17,
  "miss_count": 25,
  "hit_rate": 0.405
}

DELETE /cache

Clears the cache and resets statistics.


Installation

Clone the repository

git clone https://github.com/nethra0906/ClusterCache.git
cd ClusterCache

Create virtual environment

python -m venv venv

Activate

Windows:

venv\Scripts\activate

Mac/Linux:

source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Running the API

Start the FastAPI server

uvicorn main:app --reload

Server will run at

http://127.0.0.1:8000

API documentation available at

http://127.0.0.1:8000/docs

Example Queries

Try queries like:

"space shuttle technology"
"gun control debate"
"computer graphics rendering"
"baseball statistics"

Project Structure

semantic-search-system
│
├── api
│   └── service.py
│
├── cache
│   └── semantic_cache.py
│
├── clustering
│   └── fuzzy_cluster.py
│
├── embeddings
│   ├── generate_embeddings.py
│   └── vector_store.py
│
├── utils
│   ├── load_dataset.py
│   └── query_engine.py
│
├── data
│
├── main.py
├── req.txt
└── README.md

Possible Future Improvements

  • Cluster-aware cache lookup
  • HNSW indexing for faster vector retrieval
  • Query intent detection
  • Cluster visualization using t-SNE or UMAP
  • Distributed vector storage

Author

Nethra Krishnan
B.Tech Computer Science (Data Science)
VIT Vellore

About

A semantic search system using vector embeddings, fuzzy clustering, FAISS indexing, and a custom semantic cache with a FastAPI service.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors