Synthetic data generation pipeline from our technical report.
Generates synthetic multi-hop search tasks across multiple domains. Each domain follows an explore → verify → extend pattern to produce multi-step retrieval tasks.
Context-1 model weights are available here.
# Install dependencies
uv sync
# For all optional dependencies (reranking, patents, indexing)
uv sync --all-extras
# Configure environment
cp .env.example .env # then fill in API keys| Variable | Used by |
|---|---|
ANTHROPIC_API_KEY |
All domains |
SERPER_API_KEY |
Web (search + scrape) |
JINA_API_KEY |
Web (backup page fetcher) |
OPENAI_API_KEY |
Web, SEC, Email, Patents (embeddings) |
CHROMA_API_KEY |
Web, SEC, Email, Patents (indexing) |
CHROMA_DATABASE |
Web, SEC, Email, Patents (indexing) |
BASETEN_API_KEY |
SEC (reranking) |
Each domain has its own pipeline command and README with full documentation:
- Web — multi-hop search tasks from the open web
- SEC — SEC filing tasks
- Patents — patent prior-art tasks
- Email (Epstein) — email search tasks
agentic_search_data_gen/
├── core/ # Shared base classes and utilities
│ ├── explore.py # BaseExplorerAgent
│ ├── extend.py # BaseExtenderAgent
│ ├── verify.py # BaseVerifier
│ ├── distract.py # BaseDistractorAgent
│ ├── rerank.py # Baseten reranker client
│ ├── indexing.py # ChromaDB indexing utilities
│ └── utils.py # Anthropic client, token counting, quote matching
├── domains/
│ ├── web/ # Web search tasks
│ ├── sec/ # SEC filing tasks
│ ├── patents/ # Patent prior-art tasks
│ └── epstein/ # Email search tasks