Stem Agent

A self-specializing AI agent that evolves from an undifferentiated core into a task-specific specialist through guided differentiation.

Full writeup: docs/writeup.pdf (4 pages).

demo.mp4

The clip above replays a recorded run from docs/example_run/journal.json (no live API calls) and lands on the headline numbers: baseline F1 0.000 → specialized F1 0.778 on the 20-sample benchmark. Pausable; 15 seconds.

Quick Start

Prerequisites

Python 3.11+
An OpenAI API key

Setup

# Clone the repository
git clone https://github.com/qflen/stem-agent.git
cd stem-agent

# Create virtual environment and install
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Or with uv (faster)
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Configuration

Set your OpenAI API key:

export OPENAI_API_KEY="your-key-here"

# Or create a .env file
echo 'OPENAI_API_KEY=your-key-here' > .env

Usage

# Run the full differentiation process
stem-agent differentiate --domain code_quality_analysis

# Review a Python file with the specialized agent
stem-agent review path/to/file.py

# View evaluation results
stem-agent evaluate

# Pretty-print the evolution journal
stem-agent journal --last

Development

# Run tests
make test

# Run linter
make lint

# Format code
make format

# Run full evaluation
make eval

Architecture

The stem agent follows a biological differentiation metaphor:

UNDIFFERENTIATED → SENSING → DIFFERENTIATING → VALIDATING → SPECIALIZED → EXECUTING
                                    ▲              │
                                    │              │
                                    └── ROLLBACK ──┘

Phases

Sensing: Queries an LLM to build structured domain knowledge
Planning: Selects capabilities and designs a multi-pass review architecture
Specialization: Assembles the specialized agent from prompt fragments and tools
Validation: Benchmarks against a ground-truth corpus with regression gates
Execution: The specialized agent reviews code

Project Structure

src/stem_agent/
├── core/           # Agent, state machine, journal, config
├── phases/         # Sensing, planning, specialization, validation
├── capabilities/   # Registry, tools, prompt library
├── evaluation/     # Metrics, benchmark, comparator, fixtures
├── ports/          # LLM and storage protocols
└── adapters/       # OpenAI and JSON file implementations

Evaluation

The benchmark corpus contains 20 Python code samples with ground-truth labels:

5 logic bugs (off-by-one, wrong operators, missing null checks)
4 security vulnerabilities (SQL injection, path traversal, hardcoded credentials)
4 code smells (deep nesting, god functions, dead code)
2 performance issues (N+1 queries, unnecessary copies)
5 clean code samples (adversarial true negatives that look suspicious but are correct)

Precision, recall, F1, and specificity, measured before and after specialization on the same corpus.

142 deterministic tests, no network.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src/stem_agent		src/stem_agent
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stem Agent

Quick Start

Prerequisites

Setup

Configuration

Usage

Development

Architecture

Phases

Project Structure

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stem Agent

Quick Start

Prerequisites

Setup

Configuration

Usage

Development

Architecture

Phases

Project Structure

Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages