Skip to content

Liquid4All/context-1-data-gen

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chroma Context-1 Data Generation

Synthetic data generation pipeline from our technical report.

Generates synthetic multi-hop search tasks across multiple domains. Each domain follows an explore → verify → extend pattern to produce multi-step retrieval tasks.

Context-1 model weights are available here.

Setup

# Install dependencies
uv sync

# For all optional dependencies (reranking, patents, indexing)
uv sync --all-extras

# Configure environment
cp .env.example .env  # then fill in API keys

Required environment variables

Variable Used by
ANTHROPIC_API_KEY All domains
SERPER_API_KEY Web (search + scrape)
JINA_API_KEY Web (backup page fetcher)
OPENAI_API_KEY Web, SEC, Email, Patents (embeddings)
CHROMA_API_KEY Web, SEC, Email, Patents (indexing)
CHROMA_DATABASE Web, SEC, Email, Patents (indexing)
BASETEN_API_KEY SEC (reranking)

Domains

Each domain has its own pipeline command and README with full documentation:

  • Web — multi-hop search tasks from the open web
  • SEC — SEC filing tasks
  • Patents — patent prior-art tasks
  • Email (Epstein) — email search tasks

Project structure

agentic_search_data_gen/
├── core/                    # Shared base classes and utilities
│   ├── explore.py           # BaseExplorerAgent
│   ├── extend.py            # BaseExtenderAgent
│   ├── verify.py            # BaseVerifier
│   ├── distract.py          # BaseDistractorAgent
│   ├── rerank.py            # Baseten reranker client
│   ├── indexing.py          # ChromaDB indexing utilities
│   └── utils.py             # Anthropic client, token counting, quote matching
├── domains/
│   ├── web/                 # Web search tasks
│   ├── sec/                 # SEC filing tasks
│   ├── patents/             # Patent prior-art tasks
│   └── epstein/             # Email search tasks

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%