A MongoDB Atlas-powered intelligent document processing system that demonstrates how a unified data platform enables sophisticated AI workflows. This implementation showcases three distinct architectural patterns, all built on MongoDB as the backbone:
- Supervisor Multi-Agent Orchestration - MongoDB stores workflow states, document metadata, and assessments across all agents
- Agentic RAG System - MongoDB's vector search and checkpointing enable self-correcting Q&A with conversation memory
- Automated Report Generation - MongoDB's flexible schema stores templates, schedules, and enables semantic search across sections
Why MongoDB is Central: Unlike traditional architectures that require separate databases for structured data, documents, vectors, and operational state, MongoDB Atlas serves as the single source of truth for all data types. This unified approach eliminates data silos, reduces complexity, and enables real-time AI decision-making.
Key Technologies: The system leverages MongoDB Atlas as its foundation, enhanced with LangGraph for orchestration, VoyageAI's voyage-context-3 for embeddings, and AWS Bedrock's Claude 3.5 Sonnet v2 for vision AI.
-
Prerequisites:
- Required: MongoDB Atlas account with connection string
- Required: VoyageAI API key for embeddings
- Required: AWS account with Bedrock access (Claude 3.5 Sonnet v2)
- Optional: AWS SSO configured for S3 document access
-
Environment Setup (
β οΈ CRITICAL - Demo won't work without this):# Create the backend .env file cd backend cp .env.example .env # If example exists # Edit .env with your actual value (see Configuration & Environment Variables section below)
-
Database Setup:
# Import seed data (from project root) cd backend/db/collections/seeds # OPTIONAL: Only if using S3/GDrive sources (not required for local file upload) # Copy and configure S3/GDrive settings if needed cp document_intelligence.buckets.example.json document_intelligence.buckets.json cp document_intelligence.gdrive.example.json document_intelligence.gdrive.json # Edit these files with your actual bucket/folder IDs # Import essential collections mongoimport --uri "$MONGODB_URI" --collection industry_mappings --file document_intelligence.industry_mappings.json --jsonArray mongoimport --uri "$MONGODB_URI" --collection report_templates --file document_intelligence.report_templates.json --jsonArray mongoimport --uri "$MONGODB_URI" --collection agent_personas --file document_intelligence.agent_personas.json --jsonArray # Import S3/GDrive configs only if using those sources mongoimport --uri "$MONGODB_URI" --collection buckets --file document_intelligence.buckets.json --jsonArray mongoimport --uri "$MONGODB_URI" --collection gdrive --file document_intelligence.gdrive.json --jsonArray # Optional: Import other sample data for testing # See all available seed files in backend/db/collections/seeds/
π‘ Quick Demo: For a fast start with local files only:
- Skip S3/GDrive configuration (steps above marked OPTIONAL)
- Use sample documents in
backend/data/seed/docs/ - View pre-generated reports in
backend/data/seed/reports/ - Upload your own documents via the API at
http://localhost:8080/api/upload/documents
-
Run the Application:
# From project root docker-compose up --build -
Create Vector Search Index (Required for Q&A functionality):
- Go to MongoDB Atlas β your cluster β Atlas Search
- Create new index on
chunkscollection - Index name:
document_intelligence_chunks_vector_index - Use this JSON configuration:
{ "fields": [ { "type": "vector", "path": "embedding", "numDimensions": 1024, "similarity": "cosine" } ] }
Key Features:
- π― Smart Ingestion: Context-aware assessment based on industry and topic
- ποΈ Pure Vision Understanding: Claude 3.5 Sonnet v2 for document extraction
- π Context-Aware Embeddings: Each chunk maintains full document context via voyage-context-3
- π€ Multi-Agent Architecture: Specialized agents with LangGraph orchestration
- ποΈ Multi-Source Support: Local files, AWS S3, and Google Drive with unified workflow
- π Deduplication: Intelligent caching prevents reprocessing
- π Industry-Specific: Configurable mappings for different verticals
- π Agentic RAG: Self-correcting Q&A with document grading and query rewriting
- π Scheduled Reports: Automated PDF generation with section-specific semantic search
graph TD
A[Document Sources] --> B[π§ Supervisor Agent<br/>Central Coordinator]
A1[π Local Files] --> A
A2[βοΈ AWS S3] --> A
A3[π Google Drive] --> A
B --> C{Routing Decision}
C -->|"1. Discover"| D[π Scanner Agent<br/>Document Discovery]
C -->|"2. Assess"| E[π Evaluator Agent<br/>Relevance Check]
C -->|"3. Extract"| F[πΈ Extractor Agent<br/>Vision AI]
C -->|"4. Process"| G[πΎ Processor Agent<br/>Chunk & Embed]
C -->|Complete| H[β
End]
D -->|Found Docs| B
E -->|Assessment| B
F -->|Markdown| B
G -->|Stored| B
subgraph "MongoDB Atlas Storage"
P1[documents]
P2[chunks]
P3[assessments]
P4[workflows]
end
G --> P1
G --> P2
E --> P3
B --> P4
style B fill:#e1f5fe,stroke:#01579b,stroke-width:3px
style D fill:#f3e5f5
style E fill:#e8f5e8
style F fill:#fff3e0
style G fill:#fce4ec
Pattern: The supervisor pattern is a multi-agent architecture where a central supervisor agent coordinates specialized worker agents. This approach excels when tasks require different types of expertise. Rather than building one agent that manages tool selection across domains, we create focused specialists coordinated by a supervisor who understands the overall workflow.
In our Document Intelligence system, the supervisor agent orchestrates:
- Scanner Agent: Discovers documents from multiple sources (Local files, AWS S3, Google Drive)
- Evaluator Agent: Assesses document relevance based on industry and use case context
- Extractor Agent: Extracts content using vision AI (Claude 3.5 Sonnet v2) and converts to markdown
- Processor Agent: Splits documents into chunks and generates context-aware embeddings (voyage-context-3)
Why Use a Supervisor?
Following the LangChain supervisor pattern, our multi-agent architecture allows us to:
- Partition tools across workers: Each agent has access only to relevant tools (e.g., Scanner has S3/GDrive access, Extractor has vision AI)
- Focused expertise: Each agent has individual prompts and instructions specific to their domain
- Manage complexity: Instead of one agent handling document discovery, evaluation, extraction, AND processing, we separate concerns
- Iterative improvement: If performance degrades, we can improve individual agents without affecting the entire system
MongoDB Value for Ingestion:
- Unified Data Platform: Structured metadata, unstructured documents, and vector embeddings all in one database
- Workflow Tracking: Real-time state persistence across all agents
- Document Deduplication: Prevent reprocessing with document status tracking
- Assessment: Industry-specific relevance scores and decisions
graph TD
A[User Query] --> B[π€ Generate Query or Respond<br/>Retrieval Agent]
B --> C{Tool Call?}
C -->|No Tool Call| D[Direct Response]
C -->|retrieve_documents| E[π§ Retrieve Tool<br/>Semantic Search]
E --> F[MongoDB Atlas<br/>Vector Search]
F --> G[π Retrieved Chunks]
G --> H[π Grade Documents<br/>Relevance Check]
H --> I{Relevant?}
I -->|Yes| J[π Generate Answer]
I -->|No| K[βοΈ Rewrite Question]
K --> L[Improved Query]
L --> B
J --> M[Answer with Citations]
D --> N[Response to User]
M --> N
subgraph "MongoDB Collections"
O[chunks]
P[gradings]
Q[checkpoints_aio]
R[agent_personas]
end
F --> O
H --> P
B --> Q
B --> R
style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style H fill:#e8f5e8
style K fill:#fff3e0
style J fill:#fce4ec
style E fill:#d4e157
Pattern: Agentic Retrieval-Augmented Generation with Self-Correction
Following the LangGraph Agentic RAG pattern, our Q&A system implements a retrieval agent that makes intelligent decisions about whether to retrieve context from MongoDB Atlas Vector Search or respond directly to the user.
Key Components:
-
Retrieval Agent (Query Generator):
- Decides whether to retrieve context using semantic search or respond directly
- Uses voyage-context-3 embeddings for semantic search
- Bound with retriever tool that searches MongoDB chunks collection
-
Document Grader (Conditional Edge):
- Grades retrieved documents for relevance to the user question
- Returns binary score ('yes' or 'no') for each chunk
- Routes to answer generation if relevant, query rewriting if not
-
Query Rewriter:
- Self-correcting mechanism when retrieved documents aren't relevant
- Reformulates the question for better retrieval results
- Loops back to the retrieval agent for another attempt
-
Answer Generator:
- Synthesizes final answer from relevant retrieved chunks
- Maintains citations and source tracking
- Provides comprehensive answers based on context
MongoDB Integration:
- Semantic Search: Find relevant content based on meaning, not just keywords
- Memory Persistence: checkpoint_writes_aio and checkpoints_aio for conversation state
- Agent Personas: Use-case specific configurations (Credit Rating Analyst, Investment Research Analyst, etc.)
- Grading Storage: Stores document relevance assessments for analysis
Why Agentic RAG?
Unlike traditional RAG that always retrieves, our system:
- Makes intelligent decisions: Knows when retrieval is necessary vs. direct response
- Self-corrects: If retrieved documents aren't relevant, rewrites query and tries again
- Maintains context: Uses MongoDB checkpointing for multi-turn conversations
- Provides transparency: Tracks workflow steps and grading decisions
Workflow Steps:
-
User Query β Generate Query or Respond
- LLM decides: answer directly or search for information
- Has access to
retrieve_documentstool via.bind_tools() - For greetings/simple questions β direct response
- For document-specific questions β tool call
-
Tool Call Decision β Retrieve Tool
- If tool called: Generate voyage-context-3 embedding
- Search MongoDB chunks collection with optional document filtering
- Return top-k relevant chunks with metadata
-
Grade Documents β Routing Decision
- Each retrieved chunk is graded for relevance
- Binary score: 'yes' (relevant) or 'no' (not relevant)
- Grading results stored in MongoDB for analysis
-
Answer Generation or Query Rewriting
- If relevant β Generate comprehensive answer with citations
- If not relevant β Rewrite query and loop back to step 1
- Maximum iterations prevent infinite loops
Implementation Details:
- Uses
MessagesStatefor state management tools_conditionfor conditional routing after tool calls- Custom grading prompt for document relevance assessment
- Structured output (
GradeDocumentsschema) for consistent grading - Thread-based memory with MongoDB checkpointer for conversation persistence
MongoDB Value for Q&A:
- Unified Data Platform: Structured metadata, unstructured text chunks, and vector embeddings coexist seamlessly
- Vector Search: Lightning-fast semantic search across millions of documents
- Conversation Memory: Checkpointing system for multi-turn dialogue persistence
- Agent Personas: Store use-case specific configurations and prompts
Pattern: Automated Report Generation with Section-Specific Semantic Search
The reporting system generates weekly industry reports by leveraging MongoDB's semantic search capabilities to gather relevant content for each report section. The scheduler runs automated jobs that create professional PDF reports for different industry/use case combinations.
Key Components:
-
Report Templates:
- MongoDB stores configurable templates by industry/use case
- Each template defines report structure and section-specific prompts
- Enables customized reports for Credit Rating, Investment Research, etc.
-
Section-Specific Semantic Search:
- Each report section has its own targeted semantic query
- Searches chunks collection using voyage-context-3 embeddings
- Context accumulation ensures consistency across sections
-
Scheduled Generation:
- Weekly automated reports (configurable by industry)
- Tracks report metadata and file paths in scheduled_reports collection
- Automatic cleanup of old reports (keeps last 7)
-
PDF Output:
- Professional reports using ReportLab
- Storage-agnostic output (local filesystem or cloud)
- Includes key metrics, analysis sections, and disclaimer
MongoDB Value for Reports:
- Unified Data Platform: Report templates, document chunks, and vector embeddings all accessible from one database
- Template Storage: Flexible schema for industry-specific report structures
- Semantic Search: Fast content retrieval based on meaning for each report section
- Metadata Tracking: Report generation history and file management
- Scalability: Handle large document corpuses for comprehensive reports
/
βββ backend/ # FastAPI Backend
β βββ agents/ # Agent Implementations
β β βββ agentic_rag_qa.py # Agentic RAG Q&A system
β β βββ evaluator.py # Document evaluation agent
β β βββ extractor.py # Content extraction agent
β β βββ scanner.py # Document discovery agent
β β βββ state.py # State management
β β βββ supervisor.py # Orchestration agent
β βββ api/ # API Layer
β β βββ dependencies.py # FastAPI dependencies
β β βββ routes/ # API endpoints
β β βββ documents.py # Document management
β β βββ ingestion.py # Document ingestion
β β βββ qa.py # Q&A endpoints
β β βββ reports.py # Report generation
β β βββ upload.py # Document upload
β βββ cloud/ # Cloud Service Integrations
β β βββ aws/ # AWS Services
β β β βββ bedrock/ # AWS Bedrock Integration
β β β β βββ claude_vision.py # Claude 3 vision
β β β β βββ client.py # AWS client setup
β β β βββ s3/ # S3 Storage Integration
β β β βββ bucket_access.py # S3 operations
β β β βββ client.py # S3 client setup
β β βββ gdrive/ # Google Drive Integration
β β βββ gdrive_access.py # Drive operations
β β βββ simple_extraction.py # Public scraping
β βββ config/ # Configuration Files
β β βββ industry_config.py # Industry mappings
β β βββ storage_config.py # Storage config
β βββ data/ # Demo Data
β β βββ seed/ # Sample files
β β βββ docs/ # Sample documents
β β βββ reports/ # Sample reports
β βββ db/ # Database Layer
β β βββ collections/ # MongoDB seed data
β β β βββ seeds/ # Config templates
β β βββ mongodb_connector.py # MongoDB connection
β β βββ vector_search.py # Vector search
β β βββ vector_search_index_creator.py # Index setup
β βββ processors/ # Document Processing
β β βββ document_processor.py # Main processor
β βββ services/ # Background Services
β β βββ document_cache.py # Caching service
β β βββ report_generator.py # PDF generation
β β βββ scheduler.py # Report scheduler
β βββ tools/ # LangChain Tools
β β βββ document_tools.py # Document handling
β β βββ embedding_tools.py # Embedding generation
β β βββ vision_tools.py # Vision processing
β βββ vogayeai/ # VoyageAI Integration
β β βββ context_embeddings.py # Context embeddings
β βββ workflows/ # LangGraph Workflows
β β βββ ingestion_builder.py # Workflow builder
β β βββ ingestion_workflow.py # Ingestion flow
β βββ main.py # FastAPI application
β βββ pyproject.toml # Python dependencies
β
βββ diagrams/ # Architecture Diagrams
β βββ 1_high_level_architecture.png
β βββ 2_part1_ingestion_multiagent_supervisor.png
β βββ 3_part1_multiagent_supervisor_pattern_explanation.png
β βββ 4_part2_QandA_agentic_rag.png
β βββ 5_part2_agentic_rag_pattern_explanation.png
β βββ 6_part3_scheduled_reports.png
β
βββ environments/ # Environment Configs
β βββ prod.yaml # Production config
β βββ staging.yaml # Staging config
β
βββ docker-compose.yml # Docker services
βββ Dockerfile.backend # Backend container
βββ makefile # Build commands
βββ README.md # This file
.env file in the backend/ directory. Without proper configuration, the demo will not work.
# Copy the example file (if available)
cp backend/.env.example backend/.env
# Then edit backend/.env with your actual values# FSI Document Intelligence Environment Variables
# MongoDB Configuration (REQUIRED)
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/
DATABASE_NAME=document_intelligence
APP_NAME="ist.demo.document_intelligence.fsi"
# Collection Names
DOCUMENTS_COLLECTION=documents
CHUNKS_COLLECTION=chunks
ASSESSMENTS_COLLECTION=assessments
GRADINGS_COLLECTION=gradings
LOGS_COLLECTION=logs
AGENT_PERSONAS_COLLECTION=agent_personas
# Search Index
CHUNKS_VECTOR_INDEX="document_intelligence_chunks_vector_index"
# Document Storage Configuration
DOCUMENT_STORAGE_PATH=/docs
ALLOWED_INDUSTRIES=fsi,manufacturing,retail,healthcare,media,insurance
DEFAULT_INDUSTRY=fsi
# Upload Configuration
MAX_UPLOAD_SIZE_MB=1
ALLOWED_FILE_EXTENSIONS=pdf,docx,doc
# AWS Configuration (REQUIRED)
AWS_REGION=us-east-1
S3_BUCKET_NAME="industry-solutions-demos"
S3_BASE_PREFIX="industry/cross/document-intelligence"
# Optional: If using AWS SSO (recommended)
AWS_PROFILE=your-sso-profile-name
# Google Drive Configuration
GDRIVE_ROOT_FOLDER_ID=your-folder-id-here
# Bedrock Model Configuration (REQUIRED)
# Note: Use inference profile ID (us.*) for better availability
BEDROCK_MODEL_ID=us.anthropic.claude-3-5-sonnet-20241022-v2:0
BEDROCK_MAX_TOKENS=8192
# VoyageAI Configuration (REQUIRED)
VOYAGE_API_KEY=your-api-key-here
VOYAGE_MODEL=voyage-context-3
# Chunking Configuration
MAX_FILE_SIZE_MB=1
MAX_PAGES_PER_DOCUMENT=6
CHUNK_SIZE=2000
CHUNK_OVERLAP=0- MongoDB URI: Replace with your Atlas connection string
- AWS Profile: Configure SSO with
aws configure ssofor secure access - VoyageAI Key: Get from VoyageAI
- Google Drive: Use folder ID from shareable link (optional)
- Bedrock Model: Claude 3.5 Sonnet v2 with vision capabilities
The system uses 15 MongoDB collections organized by function:
π Sample Data: See
backend/db/collections/seeds/for JSON seed files to populate these collections for local development.
- chunks: Document chunks with embeddings - stores text segments with voyage-context-3 vectors
- documents: Document metadata - tracks processing status and source information
- assessments: Document evaluation results - stores relevance scores and processing decisions
- workflows: Ingestion workflow tracking - monitors multi-agent processing state
- gradings: Document relevance grading - binary scores from Q&A retrieval assessment
- logs_qa: Q&A session logs - tracks agentic RAG workflow steps and decisions
- agent_personas: Use-case specific AI configurations - stores prompts and capabilities per industry
- checkpoint_writes_aio: LangGraph state writes - async persistence for conversation state
- checkpoints_aio: Conversation memory - stores thread-based dialogue history
- scheduled_reports: Generated report metadata - tracks PDF locations and generation history
- report_templates: Report structure templates - defines sections and prompts by use case
- buckets: S3 bucket configurations - stores AWS bucket paths and access settings
- gdrive: Google Drive configurations - public folder IDs for document scraping
- industry_mappings: Industry classifications - maps industries to relevant topics and keywords
- logs: Workflow execution logs - INFO level logs for monitoring agent decisions
From backend/pyproject.toml:
# Core Framework
pymongo>=4.10.1 # MongoDB driver
python-dotenv>=1.0.1 # Environment variables
fastapi>=0.115.4 # API framework
uvicorn>=0.32.0 # ASGI server
pydantic>=2.0.0 # Data validation
python-multipart>=0.0.6 # File upload support
# AI/LLM Framework
langgraph>=0.2.0 # Agent orchestration
langchain>=0.3.0 # LLM framework
langchain-mongodb>=0.2.0 # MongoDB integration
langchain-aws>=0.2.0 # AWS Bedrock integration
langchain-voyageai>=0.1.0 # VoyageAI integration
langgraph-store-mongodb>=0.1.0 # State storage
langgraph-checkpoint-mongodb>=0.1.0 # Memory persistence
# Embeddings
voyageai>=0.3.2 # Context-aware embeddings
# Document Processing
pdf2image>=1.16.3 # PDF to image conversion
python-docx>=1.1.0 # DOCX file handling
pillow>=10.0.0 # Image processing
requests>=2.31.0 # HTTP client (Google Drive)
# AWS Services
boto3>=1.35.70 # AWS SDK
botocore>=1.35.70 # AWS core
# Report Generation
schedule>=1.2.0 # Task scheduling
matplotlib>=3.10.6 # Charts and graphs
reportlab>=4.4.4 # PDF generationRequired for document processing (install via system package manager):
# Ubuntu/Debian
apt-get install poppler-utils # PDF processing
apt-get install libreoffice # DOC/DOCX conversion
# macOS
brew install poppler # PDF processing
brew install libreoffice # DOC/DOCX conversionThe system supports ingesting documents from multiple sources:
π Security Note: This repository includes only example configuration files (
.example.json) inbackend/db/collections/seeds/. You must create your own configuration files with real S3 bucket names and Google Drive folder IDs. Never commit these real configuration files to public repositories. Seebackend/db/collections/seeds/README.mdfor setup instructions.
π― Demo Data: Sample documents and reports are available in
backend/data/seed/for testing:
- Documents: FSI use case examples (credit rating, KYC, loan origination, etc.) in
docs/- Reports: Pre-generated PDF reports for each use case in
reports/- Initial State: Configuration for demo scenarios in
documents_initial_state_dict.json
Documents can be uploaded via API and stored in the Docker volume:
# Upload documents
curl -X POST http://localhost:8080/api/upload/documents \
-F "files=@document.pdf" \
-F "industry=fsi" \
-F "use_case=credit_rating"
# Available use cases: credit_rating, payment_processing_exception, investment_research, kyc_onboarding, loan_origination
# List uploaded documents in an industry/use_case
curl "http://localhost:8080/api/upload/documents/fsi?use_case=credit_rating"
# Delete specific document in an industry/use_case
curl -X DELETE "http://localhost:8080/api/upload/documents/fsi/document.pdf?use_case=credit_rating"
# Delete all documents in an industry/use_case folder
curl -X DELETE "http://localhost:8080/api/upload/documents/fsi?use_case=credit_rating"
# Ingest from local storage
curl -X POST http://localhost:8080/api/ingestion/start \
-H "Content-Type: application/json" \
-d '{
"source_paths": ["@local@/docs/fsi/credit_rating"],
"workflow_id": "local_fsi_ingestion"
}'S3 bucket configuration is stored in MongoDB buckets collection. First, create your configuration from the example:
# Copy the example file
cp backend/db/collections/seeds/document_intelligence.buckets.example.json backend/db/collections/seeds/document_intelligence.buckets.json
# Edit the file to add your S3 bucket details
# Then import the configuration
mongoimport --uri "$MONGODB_URI" --collection buckets --file backend/db/collections/seeds/document_intelligence.buckets.json --jsonArrayGoogle Drive folder configuration is stored in MongoDB gdrive collection. The system uses public folder web scraping (no API keys needed). First, create your configuration from the example:
# Copy the example file
cp backend/db/collections/seeds/document_intelligence.gdrive.example.json backend/db/collections/seeds/document_intelligence.gdrive.json
# Edit the file to add your Google Drive folder IDs
# Then import the configuration
mongoimport --uri "$MONGODB_URI" --collection gdrive --file backend/db/collections/seeds/document_intelligence.gdrive.json --jsonArrayπ Document Intelligence Demo/
βββ π fsi/
βββ π credit_rating/
βββ π investment_research/
βββ π kyc_onboarding/
βββ π loan_origination/
βββ π payment_processing_exception/
# Ingest from Google Drive FSI folder
curl -X POST http://localhost:8080/api/ingestion/start \
-H "Content-Type: application/json" \
-d '{
"source_paths": ["@gdrive@fsi/credit_rating"],
"workflow_id": "gdrive_fsi_ingestion"
}'The system uses context-aware document assessment based on industry and topic. This file is included in the repository and safe to use directly:
# Import industry mappings (no sensitive data - safe to use as-is)
mongoimport --uri "$MONGODB_URI" --collection industry_mappings --file backend/db/collections/seeds/document_intelligence.industry_mappings.json --jsonArrayConfigure your own S3 bucket structure following this pattern:
- FSI:
s3://YOUR-BUCKET/your-path/fsi/ - Healthcare:
s3://YOUR-BUCKET/your-path/healthcare/ - Insurance:
s3://YOUR-BUCKET/your-path/insurance/ - Manufacturing:
s3://YOUR-BUCKET/your-path/manufacturing/ - Media:
s3://YOUR-BUCKET/your-path/media/ - Retail:
s3://YOUR-BUCKET/your-path/retail/
# Ingest from S3 FSI folder
curl -X POST http://localhost:8080/api/ingestion/start \
-H "Content-Type: application/json" \
-d '{
"source_paths": ["@s3@fsi"],
"workflow_id": "s3_fsi_ingestion"
}'
# Ingest from specific S3 subfolder with use case
curl -X POST http://localhost:8080/api/ingestion/start \
-H "Content-Type: application/json" \
-d '{
"source_paths": ["@s3@fsi/credit_rating"],
"workflow_id": "s3_fsi_credit_rating"
}'
# Mix local and S3 sources in one workflow
curl -X POST http://localhost:8080/api/ingestion/start \
-H "Content-Type: application/json" \
-d '{
"source_paths": [
"@local@/docs/fsi/credit_rating",
"@s3@fsi/reports"
],
"workflow_id": "mixed_sources_ingestion"
}'The system uses AWS SSO for authentication. No access keys required:
- Configure AWS SSO:
aws configure sso - Login:
aws sso login --profile your-profile - Set environment variable:
export AWS_PROFILE=your-profile
All source types use a consistent prefix pattern for clarity:
- Local files:
@local@/docs/{industry}/{use_case} - S3 files:
@s3@{industry}or@s3@{industry}/{subfolder} - Google Drive:
@gdrive@{industry}/{use_case} - All three sources can be mixed in the same ingestion workflow
Document paths stored in MongoDB include full source information:
- Local:
@local@/path/to/file.pdf - S3:
@s3@bucket-name/path/to/file.pdf - Google Drive:
@gdrive@industry/use_case/file.pdf
curl -X POST http://localhost:8080/api/ingestion/start \
-H "Content-Type: application/json" \
-d '{
"source_paths": [
"@local@/docs/fsi/credit_rating",
"@s3@fsi/reports",
"@gdrive@fsi/compliance"
],
"workflow_id": "mixed_all_sources"
}'The system evaluates documents based on their industry and topic context extracted from the source path:
-
Path Analysis: Extracts industry and topic from source paths
- Example:
@s3@fsi/credit_ratingβ Industry: "financial services", Topic: "credit rating"
- Example:
-
Relevance Scoring: Documents are evaluated against:
- Industry relevance (e.g., is this a financial services document?)
- Topic relevance (e.g., is this about credit ratings?)
- Documents matching EITHER criteria are accepted
-
Strict Filtering: Automatically rejects:
- Food receipts, personal documents, entertainment content
- Documents with no business relevance to the context
- Test or sample documents
- fsi: Financial Services
- healthcare: Healthcare
- insurance: Insurance
- manufacturing: Manufacturing
- media: Media and Entertainment
- retail: Retail
# Ensure you are over /backend directory
cd backend
# Run locally
uv run uvicorn main:app --host 0.0.0.0 --port 8080 --reload# Build and run all services
docker-compose up --build
# Run backend only
docker-compose up document-intelligence-backendThe system provides comprehensive API documentation through FastAPI's automatic documentation features:
- Swagger UI: Navigate to
{URL}/docs(e.g.,http://localhost:8080/docs) - ReDoc: Alternative documentation at
{URL}/redoc
- Document Management: Upload, list, and manage documents
- Ingestion Workflows: Start and monitor document processing
- Q&A System: Query documents with agentic RAG
- Report Generation: Generate and retrieve scheduled reports
- System Health: Status and configuration endpoints
π‘ Tip: The Swagger UI at
/docsprovides an interactive interface where you can:
- Explore all available endpoints
- View request/response schemas
- Test API calls directly from your browser
- See real-time responses and error codes
The system implements sophisticated conversation persistence using MongoDB and LangGraph's checkpointing system:
- Thread-Based Sessions: Each conversation has a unique
thread_idfor state isolation - Async Checkpointing: Non-blocking state persistence using
checkpoint_writes_aiocollection - State Recovery: Automatic restoration of conversation context across requests
- Memory Types:
- Working Memory: Active conversation state in LangGraph
- Long-term Memory: Persisted checkpoints in MongoDB
- Session Metadata: Thread IDs, timestamps, and user context
- Session Initiation: Generate unique session ID when user starts Q&A
- State Checkpointing: After each agent decision, state is saved to MongoDB
- Context Retrieval: Previous messages and decisions loaded for continuity
- Memory Cleanup: Optional session cleanup via API endpoints
checkpoints_aio: Stores complete conversation statescheckpoint_writes_aio: Handles async write operationslogs_qa: Tracks session events and agent decisionsgradings: Preserves document relevance assessments per session
# Start new session
session_id = "user-123-session-456"
# Query with memory
response = await qa_system.answer_with_agentic_rag(
query="What is the credit rating?",
thread_id=session_id # Enables conversation memory
)
# Follow-up uses same session
follow_up = await qa_system.answer_with_agentic_rag(
query="Why did it change?", # Understands context from previous question
thread_id=session_id
)-
Ingestion: Documents need systematic, quality-controlled processing
- Deterministic workflow ensures data integrity
- Quality gates prevent irrelevant content
- Sequential processing ensures completeness
-
Q&A: User queries need adaptive, intelligent responses
- Dynamic routing based on query complexity
- Self-correction through document grading
- Iterative improvement through query rewriting
-
Reports: Automated content generation needs targeted context retrieval
- Section-specific semantic search for accurate context
- Context accumulation for report consistency
- Scheduled generation with fallback mechanisms
- Unified Data Platform: Store structured metadata, unstructured documents, and vector embeddings in a single database
- Vector Search: Lightning-fast semantic search across document collections
- Multi-Collection Intelligence: Specialized collections for workflows, assessments, personas, and more
- Real-time Updates: Live workflow tracking and conversation memory persistence
- Scalability: Atlas handles enterprise-scale document processing with automatic scaling
- AWS credentials are mounted as read-only in Docker
- API keys should be stored in environment variables
- MongoDB connection strings should be secured
-
"MongoDB URI must be provided" Error
- Ensure
backend/.envfile exists withMONGODB_URIset - Check that Docker Compose is reading the env file correctly
- Ensure
-
"VOYAGE_API_KEY is required" Error
- Add your VoyageAI API key to
backend/.env - Sign up at https://www.voyageai.com if you don't have one
- Add your VoyageAI API key to
-
AWS Authentication Errors
- Configure AWS SSO:
aws configure sso - Login:
aws sso login --profile your-profile - Set
AWS_PROFILEinbackend/.env
- Configure AWS SSO:
-
Document Processing Fails
- Check
MAX_FILE_SIZE_MBandMAX_PAGES_PER_DOCUMENTin.env - Ensure Docker has enough memory allocated (at least 4GB)
- Check
-
S3/Google Drive Not Working
- Import the MongoDB seed configurations (see Database Setup)
- Check that bucket/folder IDs are correct in your JSON files
-
"Invocation of model ID with on-demand throughput isn't supported" Error
- Use the inference profile ID instead of direct model ID
- For Claude 3.5 Sonnet v2: Use
us.anthropic.claude-3-5-sonnet-20241022-v2:0 - Check available inference profiles:
aws bedrock list-inference-profiles
- MongoDB Atlas Vector Search Documentation
- VoyageAI: voyage-context-3
- Concept: Multi-agent systems
- LangGraph Agentic RAG Tutorial
- LangChain Supervisor Pattern
- AWS Bedrock Claude 3.5 Sonnet
See LICENSE file for details.





