Scribe is a web-based OCR correction tool. Upload images or point it at a IIIF manifest, run OCR, then fix the results visually in an image-aligned text editor. All data is stored per-user and the API is defined end-to-end in protobuf with Connect RPC.
The application now runs as separate frontend, api, and worker processes.
The frontend serves the web app and proxies backend routes. The API hosts
Connect RPC, annotation, and IIIF HTTP routes. The worker consumes background
transcription jobs.
Scribe is standardizing on IIIF Presentation 3 AnnotationPage JSON as the
canonical persisted OCR correction model, using the IIIF Text Granularity
Extension for page/block/paragraph/line/word/glyph structure:
https://iiif.io/api/extension/text-granularity/
That means:
- IIIF is the canonical saved correction state
- hOCR, PageXML, ALTO, and plain text are export/import formats
- editor-specific UI state is transient and not the canonical storage model
- revision metadata such as
updated_by,updated_at, andrevisionis stored adjacent to the canonical IIIF payload - the API exposes annotation and text-editing actions that editor plugins can call directly rather than reimplementing split/join/transcription logic in the browser
- the same project also ships a standalone web app for item ingestion, management, OCR generation, and QA editing
The editor is designed as a custom text-first OCR correction workflow built on top of canonical IIIF annotation state.
cp sample.env .env
cp docker-compose.override-example.yaml docker-compose.override.yaml
bash generate-secrets.sh
docker compose up --buildThe local override starts the standalone segmentor and image-service
containers. The main api and worker image is now the lean remote-service
build by default and expects those helper services to exist for OCR and image
manipulation work.
The main API/worker binaries build without CGO or local OCR libraries. Local
Tesseract/Leptonica support is isolated to the standalone segmentor image,
which opts in with the localocr build tag; API/worker production builds use
the remoteocr tag and call that service over HTTP.
| Service | URL |
|---|---|
| Web app | http://localhost |
| API + Annotation API | http://localhost:8080 |
| IIIF image API path | http://localhost:8081/iiif/3 |
| Worker health | docker compose logs worker or http://worker:8080/healthz inside the compose network |
The landing page offers four ways to create an item:
| Tab | What happens |
|---|---|
| Image URL | OCR runs immediately; editor opens automatically |
| Single upload | Upload one image; OCR runs; editor opens automatically |
| Multi-upload | Upload several images into one item; appears in the table for editing |
| IIIF Manifest | Fetches all canvases from the manifest; appears in the table |
After OCR, click Edit on any item to open the page editor where you can correct line and word text against the image.
cmd/api/ HTTP API process
cmd/worker/ Background transcription worker process
web/server.mjs Frontend runtime that serves the SPA and proxies backend paths
internal/
server/ Connect handlers, canonical AnnotationPage routes, crosswalk routes
store/ MariaDB access via sqlc
proto/ Protobuf definitions (Buf managed)
web/src/
main.ts Router (~10 LOC)
api/ Connect client wrappers (items, processing, transport)
pages/ home.ts (landing page), editor.ts (editor shell)
lib/ Pure utilities
mirador-scribe/
src/ Repo-owned Mirador v4 OCR editor plugin + annotation adapter
sqlc/ SQL queries + generated Go code
Canonical data model:
- Persist one IIIF Presentation 3
AnnotationPageper page/canvas - Use IIIF Text Granularity Extension semantics for line/word/glyph annotations
- Preserve the finest source granularity available during import, such as word
boxes from hOCR
ocrx_word - Store revision and workflow metadata adjacent to the canonical annotation JSON
- Export repository-facing formats such as hOCR/PageXML/ALTO from that canonical state
API/editor contract:
- the backend is the canonical source for annotation mutations such as line splitting, line joining, word splitting, word joining, and retranscription
- editor plugins should call those API operations and then reload or reconcile the returned IIIF annotations
- this keeps plugin implementations thinner and makes the same API usable from Mirador or other IIIF-capable editors
flowchart TD
browser([Client / Browser])
subgraph edge[Edge]
frontend[Cloud Run Frontend<br/>static UI + proxy]
end
subgraph vm[Backend VM]
ingress[VM host :80]
api[API]
worker[Worker]
db[(MariaDB)]
files[(Uploads / Cache)]
ingress --> api
api --> db
worker --> db
api --> files
worker --> files
end
subgraph shared[Shared / Private Services]
vault[Vault]
cantaloupe[Shared Cantaloupe]
imageSvc[Image Service<br/>normalize / crop / stitch]
genericSeg[Generic Segmentor<br/>auto / scribe / tesseract]
krakenSeg[Kraken Segmentors<br/>one Cloud Run service per segmentation model]
ollama[Ollama OCR Services<br/>one Cloud Run service per model]
krakenOCR[Kraken OCR Services<br/>one Cloud Run service per transcription model]
end
subgraph external[External Providers]
openai[OpenAI]
gemini[Gemini]
end
context{Resolved context<br/>segmentation_model<br/>transcription_provider<br/>transcription_model<br/>optional endpoint override}
browser -->|loads app + API calls| frontend
frontend -->|proxy to backend origin| ingress
browser -->|IIIF/image requests| frontend
frontend -->|/iiif via image-service| imageSvc
imageSvc -->|reads shared uploads| api
api -->|startup secret reads| vault
worker -->|startup secret reads| vault
api -->|create item / save edits / enqueue jobs| db
worker -->|read jobs / persist OCR results| db
api -->|store source images / hOCR cache| files
worker -->|read source images / write outputs| files
api -->|normalize / crop / stitch when needed| imageSvc
worker -->|normalize / crop / stitch when needed| imageSvc
api -.->|resolve request context| context
worker -.->|resolve job context| context
context -->|segmentation_model = auto / scribe / tesseract| genericSeg
context -->|segmentation_model = kraken or kraken:*| krakenSeg
context -->|transcription_provider = ollama| ollama
context -->|transcription_provider = kraken| krakenOCR
context -->|transcription_provider = openai| openai
context -->|transcription_provider = gemini| gemini
context -.->|explicit context URL/audience overrides win| ollama
context -.->|explicit context URL/audience overrides win| krakenOCR
# Backend
make lint
make test
make build
# Regenerate proto stubs and SQL
make proto
make sqlc
make generate
# Frontend (from web/)
npm install
npm run build
npm run serve
# Frontend container
make build-frontendSQL query definitions live under sqlc/queries. The
checked-in internal/db package mirrors those queries
today and should be refreshed from them with make sqlc when the tool is
available in the development environment.
Non-secret runtime settings live in config.yaml. The
container bakes in embedded defaults and then reads /etc/scribe/config.yaml
at startup when it is mounted by Docker Compose or Terraform-managed runtime
deployment. Selected string values in that YAML support ${VAR} and
${VAR:-default} interpolation from the container environment.
Secrets do not live in .env or config.yaml. They are loaded from Vault on
startup using the paths configured under vault.paths in config.yaml. The
Vault address itself is non-secret and also lives in config.yaml as
vault.address.
On deployed GCP VMs, Scribe authenticates to Vault with the GCP auth method. It first tries a mounted service-account credential file and then falls back to the VM metadata service if that file is not available. A static Vault token is only needed as an optional local-development fallback.
For local Docker Compose only, sample.env defines
SCRIBE_API_IMAGE, SCRIBE_FRONTEND_IMAGE, and the published host ports used
by the local override stack. Copy
docker-compose.override-example.yaml
to docker-compose.override.yaml to run the local frontend on :80 while the
API stays on :8080.
The backend containers now expect a Docker Compose secret file at
./secrets/GOOGLE_APPLICATION_CREDENTIALS, mounted in-container at
/run/secrets/GOOGLE_APPLICATION_CREDENTIALS. That file is provisioned
externally in deployed environments. For local/CI compose runs,
generate-secrets.sh creates a {} placeholder when the file is missing so
the secret mount exists without fabricating real credentials.
When VAULT_ADDRESS is configured, generate-secrets.sh also rewrites
./secrets/mariadb_password and ./secrets/mariadb_root_password from the
Vault secret/scribe/database secret before Docker Compose starts MariaDB, so
the database container and the app use the same credentials source. It does
that through the init-only vault-init Compose service, which signs into Vault
from ./secrets/GOOGLE_APPLICATION_CREDENTIALS inside Docker rather than
calling the metadata server, so it still works when Docker traffic to metadata
is blocked.
Use make vault-secrets to list, read, or update the required app secrets in
Vault. The helper prompts for dev vs prod, uses your current
gcloud auth print-access-token for the proxy's X-Admin-Token, and then
logs into Vault through the google-jwt admin role for your active gcloud
account unless you explicitly override VAULT_TOKEN.
Terraform now treats Vault as two long-lived servers: shared dev and prod.
Preview environments and local dev point at the shared dev Vault. Each
deployment still gets its own Vault GCP auth role, so preview service accounts
do not need to share a single global role binding.
For local Terraform applies, the local deploy helper may need Artifact Registry
push credentials so it can publish missing frontend/OCR GAR images before
Terraform runs. Before running make tf-dev, make tf-preview, or
make tf-prod locally, configure Docker for us-docker.pkg.dev:
gcloud auth login
gcloud config set project <your-gcp-project-id>
gcloud auth configure-docker us-docker.pkg.devThis repo currently reads Docker auth from ~/.docker/config.json. Your user
also needs write access to projects/<project>/locations/us/repositories/internal.
Global runtime values such as PUBLIC_BASE_URL, VAULT_ADDRESS,
VAULT_GCP_AUTH_ROLE, OLLAMA_URL,
OLLAMA_AUDIENCE, OLLAMA_MODEL_ENDPOINTS_JSON, SEGMENTATION_SERVICE_URL, IMAGE_SERVICE_URL,
SEGMENTATION_MODEL_ENDPOINTS_JSON, KRAKEN_URL, KRAKEN_AUDIENCE,
KRAKEN_MODEL, KRAKEN_MODEL_ENDPOINTS_JSON, and VAULT_TOKEN are now
intended to be injected as container env vars and resolved by config.yaml
interpolation or startup parsing rather than by rewriting the mounted file on
disk.
Contexts can optionally override the global Ollama URL and audience, which is
the recommended setup when each model is deployed as its own cached Cloud Run
service. When the selected Ollama URL points at a private Cloud Run service,
Scribe automatically sends an ID token if the host is a *.run.app service
URL. Set llm.ollama.audience or the context-specific audience only when the
Cloud Run service uses a custom audience. When no explicit context override is
set, Ollama model routing now falls back to OLLAMA_MODEL_ENDPOINTS_JSON
keyed by transcription_model.
Kraken now follows the same one-service-per-model topology. Segmentation routes
by the context segmentation_model through SEGMENTATION_MODEL_ENDPOINTS_JSON,
and Kraken transcription routes by transcription_model through
KRAKEN_MODEL_ENDPOINTS_JSON. Contexts can still override the Kraken
transcription URL and audience directly when needed.
GET /v1/item-images/{id}/manifest IIIF Presentation v3 manifest
GET /v1/item-images/{id}/hocr Current persisted hOCR document
GET /v1/item-images/{id}/annotations IIIF annotation page bootstrap/export
GET /v1/events Server-sent event stream for job + annotation lifecycle events
GET /auth/me Current auth/session state
GET /auth/google Google OAuth login
GET /auth/callback/google Google OAuth callback
GET /auth/api-keys List API keys for the active workspace
POST /auth/api-keys Create a workspace-scoped API key
DELETE /auth/api-keys/{key_id} Revoke a workspace-scoped API key
GET /auth/provider-secrets List Vault-backed provider secrets visible in the active workspace
POST /auth/provider-secrets Create a workspace- or user-scoped provider secret
DELETE /auth/provider-secrets/{secret_id} Delete a provider secret
The application API is proto-first. New API operations should be defined in protobuf and consumed through generated Connect clients.
Annotation and OCR operations are exposed on these Connect services:
POST /scribe.v1.ItemService/*
POST /scribe.v1.ImageProcessingService/*
POST /scribe.v1.ContextService/*
POST /scribe.v1.AnnotationService/*
Plain HTTP routes should exist only when there is a concrete resource-URL
reason not to use RPC. The GET /v1/item-images/{id}/manifest,
GET /v1/item-images/{id}/annotations, and GET /v1/item-images/{id}/hocr
routes are examples of that exception: they expose dereferenceable IIIF/OCR
documents that external viewers and IIIF clients fetch directly.
Google OAuth is the only interactive login path. The shipped runtime does not support anonymous mode, local username/password auth, or an auth toggle.
Every authenticated request runs inside a workspace. Browser sessions default
to the caller's personal workspace, and can target another workspace membership
with X-Scribe-Workspace-ID. API keys are pinned to exactly one workspace and
ignore any workspace override header.
API keys are intended for scoped frontend and integration use cases such as a
Drupal-hosted Mirador plugin calling Scribe to create items, edit annotations,
or save OCR output. They are created by workspace admins via /auth/api-keys
and may be limited by both a workspace role (admin, write, create,
read) and optional scopes such as items:*, annotations:write, or
transcription:read.
Provider API keys are stored separately from the relational DB. Scribe stores only provider-secret metadata in MariaDB and writes the actual secret material to Vault. The contexts page can create personal or workspace-scoped provider secrets. Runtime resolution prefers a personal secret over a workspace secret for the same provider. Gemini is wired end-to-end today for enrichment, background transcription, and other context-driven OCR paths.
Connect and HTTP clients may authenticate with either:
Authorization: Bearer <api-key>
X-Scribe-API-Key: <api-key>
For browser or plugin code in this repo, web/src/api/transport.ts
exports createScribeTransport(...), which can attach API key and workspace
headers to a Connect transport.
Scribe emits a small CloudEvents-style event set from the backend. Clients can
consume those events either through GET /v1/events over SSE or by setting
webhooks.urls in config.yaml to fan out each event
as application/cloudevents+json.
Current event types:
dev.scribe.transcription.task.starteddev.scribe.transcription.task.completeddev.scribe.transcription.completeddev.scribe.transcription.faileddev.scribe.annotations.createddev.scribe.annotations.published
Use transcription.task.completed to drive per-line progress in the UI. Use
annotations.created and annotations.published for external integrations such
as Islandora. Save does not publish: annotations.published is emitted only
after the explicit POST /scribe.v1.AnnotationService/PublishItemImageEdits
action.
The current deployment/auth refactor plan lives in docs/infra-auth-plan.md. The current deployment shape is:
- separate
frontend,api, andworkerdeployments - backend Go image stays on the VM
- frontend image is deployed as the optional
frontendCloud Run sidecar next to ppb and proxies backend paths back to the VM - shared production Cantaloupe managed from this repo's Terraform
- shared private Ollama model services managed from this repo's Terraform
- a shared production HTTPS load balancer for the app and Cantaloupe
- a self-hosted Vault deployment managed from this repo's Terraform
- Google OAuth plus Connect interceptor-based authorization
- Vault-backed storage for user-supplied provider keys
- session hOCR state persisted in the database instead of local disk
Editor-oriented annotation operations are exposed on AnnotationService so
plugins can delegate structural OCR edits to the backend:
POST /scribe.v1.AnnotationService/SplitAnnotationIntoWords
POST /scribe.v1.AnnotationService/SplitAnnotationIntoTwoLines
POST /scribe.v1.AnnotationService/MergeAnnotationsIntoLine
POST /scribe.v1.AnnotationService/MergeWordsIntoLineAnnotation
POST /scribe.v1.AnnotationService/TranscribeAnnotation
POST /scribe.v1.AnnotationService/TranscribeAnnotationPage
Contexts bundle the OCR/transcription settings used to process or enrich an image. A context can include:
- a segmentation model
- a transcription provider/model
- additional context-selection metadata used to infer the best context from the supplied image or related metadata
Scribe seeds these system contexts on startup:
DefaultRuns bothtesseractsegmentation and the in-reposcribecustom segmentor, then keeps whichever finds more words.Tesseract OCRUses Tesseract segmentation and Tesseract transcription directly.Scribe CustomUses the custom segmentor, crops by detected line, sends each line to the configured LLM provider, and assembles the result back into line-level OCR.Kraken BLLAUses Kraken page segmentation with its default BLLA model, then crops by detected line and sends each line to the configured LLM provider.
Set a context segmentation model to kraken to use the default Kraken BLLA
segmenter, or kraken:<model-id> to pin a specific Kraken model.
When config.yaml sets llm.provider: ollama and llm.ollama.model is left
at its default, Scribe uses glm-ocr:bf16.
When llm.ollama.url targets a private Cloud Run Ollama deployment, the shared
htr provider client now attaches a Google identity token automatically. The
default audience is the Cloud Run service URL; llm.ollama.audience exists for
the rare case where you configured a custom audience explicitly.
For images uploaded or supplied without existing hOCR, the default system flow is:
- Run the Tesseract segmentor and the Scribe custom segmentor.
- Compare the number of detected words.
- Use the winning segmentation path for OCR generation.
- If Tesseract wins, keep Tesseract's text directly.
- If the Scribe segmentor wins, run the line-crop LLM transcription path.
The backend exposes context resolution so ingestion and editor operations can choose a context explicitly or let the server pick one when enough information can be inferred. OCR runs are stored with the resolved context so context-level metrics aggregate against the context that was actually used.
Scribe records edit metrics to evaluate context quality. The primary metric today is document-level Levenshtein distance between:
- the plain-text document produced by the app originally
- the plain-text document represented by the user-corrected final result
This gives a simple measure of how much correction a context required. Segmentation quality metrics are planned but still TBD.
Scribe supports two primary workflows:
-
Low/no-touch OCR generation
- ingest images or manifests
- generate canonical IIIF annotation pages
- export hOCR/PageXML/ALTO/plain text
- optionally publish results back to a parent repository system
-
Human QA correction
- load canonical IIIF annotation pages in the editor
- edit text and geometry with a text-first workflow
- save new revisions
- export or publish corrected results
Word-level OCR accuracy is not a default workflow yet. The current automatic transcription path crops by line and sends each line image to an LLM, then writes the result back as line text. Supporting first-class word-by-word OCR would require substantially more model calls and image crops, which would raise costs materially. For now, Scribe creates line annotations by default and lets the editor split a line into words when finer correction work is actually needed.
