A powerful command-line tool for indexing, searching, and chatting with your documents using AI. Features include document indexing, AI-powered chat, web scraping, and automated diagram generation.
- Document Indexing: Index PDFs, text files, markdown, and more with multi-threaded processing
- AI Chat Interface: Chat with your documents using multi-model support (cloud-based, LAN, local)
- Private AI: Private LLM & SLM support with OLLama and vLLM.
- Web Indexing: Index websites and GitHub repositories
- Diagram Generation: Create beautiful diagrams from natural language or document content
- Smart Search: Vector-based semantic search across all indexed content
- Export & Analysis: Export conversations, analyze document collections
git clone https://github.com/SumanthPal/CorpusNote.git
cd CorpusNote
chmod +x installation/install.sh
./installation/install.sh
git clone https://github.com/SumanthPal/CorpusNote.git
cd corpus-cli
powershell -ExecutionPolicy Bypass -File installation\install.ps1git clone https://github.com/SumanthPal/CorpusNote.git
cd corpus-cli
make -f installation/Makefile installThe installation script will:
- β Check Python version (3.8+ required)
- β Create a virtual environment
- β Install all dependencies
- β
Set up the
corpuscommand - β Check for optional tools (D2)
- β Guide you through initial configuration
If you prefer to install manually:
# Clone repository
git clone https://github.com/SumanthPal/CorpusNote.git
cd CorpusNote
# Create virtual environment
python3 -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install corpus command
pip install -e .
# Configure
corpus config setup# Activate virtual environment (required every time)
source env/bin/activate # Linux/macOS
# or
env\Scripts\activate # Windows# Now you can use corpus!
corpus --help
corpus config setup
After installation, you have several options to make the corpus comamdn easily accessible.
Run the provided script to automatically add corpus to your shell:
cd CorpusNote
./scripts/add_corpus_alias.sh
source ~/.zshrc # or ~/.bashrcThis creates a shell function that automatically activates the environment when you run corpus.
Add this to your ~/.zshrc or ~/.bashrc:
# Corpus CLI
corpus() {
cd /path/to/CorpusNote && source env/bin/activate && command corpus "$@"
cd - > /dev/null
}From system-wide access without shell configuration.
sudo cp installation/corpus-wrapper.sh /usr/local/bin/corpus
sudo chmod +x /usr/local/bin/corpus# Index a single file
corpus index ~/Documents/paper.pdf
# Index a directory recursively
corpus index ~/Documents/Research
# Index with specific pattern
corpus index ~/Documents --pattern "*.pdf"
# Index with custom thread count
corpus index ~/Documents --workers 8# Start interactive chat
corpus chat
# Ask a single question
corpus ask "What are the main findings about quantum computing?"
# Chat with filtered documents
corpus chat --filter "quantum*.pdf"# Interactive diagram mode
corpus diagram
# Generate diagram from description
corpus diagram "user authentication flow" --type flowchart
# Generate from document search
corpus diagram --search "network architecture" --theme professional
# Batch generation
corpus diagram-batch --file queries.txtcorpus index <path> [OPTIONS]
Options:
-r, --recursive/--no-recursive Recursively index subdirectories [default: True]
-f, --force Force re-indexing of existing files
-p, --pattern TEXT File pattern to match (e.g., '*.pdf')
-w, --workers INTEGER Number of threads (default: auto-detect)
--no-threading Disable multi-threadingcorpus update <directory> [OPTIONS]
Options:
-r, --recursive/--no-recursive Recursively check subdirectories
-w, --workers INTEGER Number of threadscorpus watch <directory> [OPTIONS]
Options:
-i, --interval INTEGER Check interval in seconds [default: 60]corpus index-url <url> [OPTIONS]
Options:
-m, --max-pages INTEGER Maximum pages to crawl [default: 50]
--same-domain/--any-domain Only crawl same domain [default: True]
-t, --github-token TEXT GitHub personal access tokencorpus index-github <owner/repo> [OPTIONS]
Options:
-t, --token TEXT GitHub personal access token for private reposcorpus chat [OPTIONS]
Options:
-f, --filter TEXT Filter documents by filename pattern
--no-sources Hide source citationscorpus ask <question> [OPTIONS]
Options:
-f, --filter TEXT Filter documents by filename pattern
--no-sources Hide source citationscorpus diagram [query] [OPTIONS]
Options:
-t, --type TEXT Diagram type (flowchart, network, etc.)
--theme TEXT Color theme (default, professional, vibrant)
-l, --layout TEXT Layout engine (dagre, elk)
-s, --search TEXT Create from document search
-e, --export TEXT Export format (png, pdf, etc.)corpus diagram-batch [OPTIONS]
Options:
-f, --file PATH File containing queries (one per line)
-t, --type TEXT Default diagram type
--theme TEXT Color themecorpus diagram-gallery [OPTIONS]
Options:
-l, --limit INTEGER Number of diagrams to show [default: 20]
-f, --format TEXT Filter by format (svg, png, pdf)
-s, --sort TEXT Sort by: modified, created, name, sizecorpus statuscorpus list [OPTIONS]
Options:
-s, --sort TEXT Sort by: name, size, date, chunks
-f, --filter TEXT Filter by filename pattern
-l, --limit INTEGER Number to show [default: 20]corpus clear [OPTIONS]
Options:
-y, --yes Skip confirmation promptcorpus remove <filename>corpus analyze <directory>corpus export [OPTIONS]
Options:
-o, --output TEXT Output filenamecorpus supportedcorpus infoThe diagram generator supports multiple types:
- flowchart: Process flows, decision trees, workflows
- network: System architecture, infrastructure diagrams
- hierarchy: Organizational charts, tree structures
- sequence: Interaction diagrams, communication flows
- erd: Entity relationship diagrams for databases
- state: State machines and transitions
- mind_map: Concept maps, brainstorming
- gantt: Project timelines (requires D2 Pro)
# Flowchart with theme
corpus diagram "user registration process" -t flowchart --theme professional
# Network diagram from documents
corpus diagram --search "microservices architecture" -t network
# Batch generation with type hints
echo "flowchart:CI/CD pipeline" > diagrams.txt
echo "network:AWS architecture" >> diagrams.txt
echo "sequence:API authentication flow" >> diagrams.txt
corpus diagram-batch -f diagrams.txt --theme vibrantCorpus uses a flexible configuration system that can be managed entirely through the CLI.
Run the interactive configuration wizard:
corpus config setupThis will guide you through setting up:
- API keys
- Model selection
- Storage paths
- Processing parameters
# Show all settings
corpus config show
# Show specific setting
corpus config show GEMINI_API_KEY
corpus config show CHUNK_SIZE
# Show full details including lists
corpus config show --all# Set individual values
corpus config set GEMINI_API_KEY "your-api-key"
corpus config set GEMINI_MODEL "gemini-1.5-pro"
corpus config set CHUNK_SIZE 1500
corpus config set MAX_FILE_SIZE_MB 200
# Set lists (use JSON format)
corpus config set CODE_EXTENSIONS '["py", "js", "ts", "java"]'# Reset everything to defaults
corpus config reset
# Reset specific setting
corpus config reset CHUNK_SIZE# Check for any issues
corpus config validate# Export to different formats
corpus config export --format json > my-config.json
corpus config export --format yaml > my-config.yaml
corpus config export --format env > .env.example| Setting | Type | Default | Description |
|---|---|---|---|
| API Settings | |||
| GEMINI_API_KEY | string | - | Your Google Gemini API key |
| GEMINI_MODEL | string | gemini-1.5-flash | Model to use for chat |
| GEMINI_IMG_MODEL | string | gemini-pro-vision | Model for image analysis |
| Storage | |||
| DB_PATH | string | ~/.corpus/research.db | ChromaDB database location |
| DIAGRAMS_PATH | string | ~/.corpus/diagrams | Where to save generated diagrams |
| COLLECTION_NAME | string | documents | ChromaDB collection name |
| Processing | |||
| CHUNK_SIZE | int | 1000 | Text chunk size for indexing |
| CHUNK_OVERLAP | int | 200 | Overlap between chunks |
| MIN_CHUNK_LENGTH | int | 50 | Minimum chunk size |
| MAX_FILE_SIZE_MB | int | 100 | Maximum file size to index |
| Search | |||
| MAX_RESULTS | int | 5 | Default search results |
| MAX_MEMORY | int | 10 | Chat memory length |
| File Types | |||
| CODE_EXTENSIONS | list | [many] | Code file extensions to index |
| IMAGE_EXTENSIONS | list | [.jpg, .png, etc] | Image formats to process |
| TEXT_EXTENSIONS | list | [.txt, .md, .pdf] | Text document formats |
You can also use environment variables, especially for sensitive data:
# .env file
GEMINI_API_KEY=your-api-key-here
# Or export directly
export GEMINI_API_KEY="your-api-key"The configuration system checks environment variables first for API keys.
Configuration is stored in:
- Config file:
~/.corpus/config.json - Env file:
.env(in current directory)
To see the exact location:
corpus config path# Change models on the fly
corpus config set GEMINI_MODEL gemini-1.5-pro
# Increase chunk size for larger documents
corpus config set CHUNK_SIZE 2000
# Add custom file extensions
corpus config set CODE_EXTENSIONS '["py","js","rs","go","java"]'
# Validate settings
corpus config validate# Auto-detect optimal threads
corpus index ~/LargeCollection
# Specify thread count
corpus index ~/LargeCollection --workers 16
# Single-threaded mode (for debugging)
research index ~/LargeCollection --no-threading# Chat with specific documents
corpus chat --filter "machine_learning*.pdf"
# Ask about specific topics
corpus ask "explain transformers" --filter "deep_learning/*"# Generate, then export to different format
corpus diagram "database schema" --type erd
corpus diagram-gallery # Find the generated file
corpus diagram "query_diagram_20240115_143022.svg" --export png-
Indexing Performance: Use
--workersto optimize for your system. Generally, 2-4x CPU cores works well. -
Search Quality: Index related documents together for better context in answers.
-
Diagram Generation:
- Be specific in descriptions for better results
- Use document search for accuracy when creating architecture diagrams
- Try different themes for various presentation contexts
-
Memory Usage: For large collections, index in batches using
--pattern. -
GitHub Indexing: Use a personal access token for better rate limits and private repo access.
-
"D2 renderer not found"
- Install D2:
curl -fsSL https://d2lang.com/install.sh | sh -s -- - Ensure D2 is in your PATH
- Install D2:
-
"API key not found"
- Run
corpus config setupto configure - Or set directly:
corpus config set GEMINI_API_KEY "your-key" - Check with:
corpus config show GEMINI_API_KEY
- Run
-
Configuration issues
- Validate config:
corpus config validate - Reset to defaults:
corpus config reset - Check location:
corpus config path
- Validate config:
-
Indexing failures
- Check file permissions
- Ensure files aren't corrupted
- Try with
--no-threadingfor debugging - Check max file size:
corpus config show MAX_FILE_SIZE_MB
-
Out of memory
- Reduce chunk size:
corpus config set CHUNK_SIZE 500 - Index in smaller batches
- Use
--workers 1to reduce memory usage
- Reduce chunk size:
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License - see LICENSE file for details
- Built with ChromaDB for vector storage
- Powered by Google Gemini for AI capabilities
- Diagrams rendered with D2
- CLI interface using Typer
- Beautiful formatting with Rich