Datassert is a high-performance CLI for building a DuckDB-backed assertion store from Babel export files, with a focus on fast local builds and simple command-driven workflows.
# Install CLI from GitHub
go install github.com/SkyeAv/datassert@latest
# Verify install
datassert --help# Build a Datassert database from Babel exports
datassert build --babel-dir /path/to/babel| Flag | Required | Default | Description |
|---|---|---|---|
--babel-dir |
Yes | N/A | Directory containing Babel *Class.ndjson.zst and *Synonyms.ndjson.zst files |
--db-dir |
No | ./.datassert |
Output directory for sharded DuckDB databases |
--batch-size |
No | 50000 |
Number of records per Parquet batch |
--buffer-size |
No | 2048 |
Channel buffer size for synonym file processing |
--class-cpu-fraction |
No | 2 |
Divisor of NumCPU() for class file goroutines |
--synonym-cpu-fraction |
No | 4 |
Divisor of NumCPU() for synonym file goroutines |
--babel-diris scanned for files matching*Class.ndjson.zstand*Synonyms.ndjson.zst.- File matching is non-recursive (top-level of the provided directory).
- Staging Parquet files are written to
./.parquet-store/. - 16 sharded DuckDB databases are written to
<db-dir>/datassert-shard{0..15}.duckdb. - Each shard contains
SOURCES,CATEGORIES,CURIES, andSYNONYMStables, sorted and indexed for query performance.
# Use defaults for db dir and batch size
datassert build --babel-dir ./babel-exports
# Write databases to a custom directory (produces ./data/mydb/datassert-shard{0..15}.duckdb)
datassert build --babel-dir ./babel-exports --db-dir ./data/mydb
# Tune Parquet batch size and concurrency
datassert build --babel-dir ./babel-exports --batch-size 100000 --class-cpu-fraction 1- Displays progress bars for class, synonym, and DuckDB build phases.
- Uses CPU-based concurrency with configurable fractions (
NumCPU()/class-cpu-fractionandNumCPU()/synonym-cpu-fraction).
Skye Lane Goetz