Skip to content

feat: batch OCR processing for all manuscript pages #120

@nikazzio

Description

@nikazzio

Summary

Add a "Run OCR on all pages" action that processes every page of a manuscript in sequence (or parallel batches), with a real-time progress bar.

Motivation

Today OCR runs one page at a time via HTMX polling. For a 200-page manuscript this means 200 manual clicks. Batch processing is the single biggest UX improvement for transcription workflows.

Proposed approach

  • Add a POST /studio/{manifest_id}/ocr/batch endpoint that queues all un-OCR'd pages as a single job.
  • Reuse the existing JobManager / export job infrastructure for progress tracking.
  • Stream progress via SSE or HTMX polling (page N/M, estimated time remaining).
  • Allow cancellation mid-batch.
  • Write each page's OCR result to the transcription DB as it completes (incremental save).

Acceptance criteria

  • "Run OCR on all pages" button in Studio toolbar
  • Progress indicator (pages done / total, ETA)
  • Cancellation support
  • Results saved incrementally — partial OCR is preserved on cancel
  • No regression on single-page OCR flow

Technical notes

  • Rate-limit OCR API calls to respect provider TOS (configurable concurrency).
  • Consider a max_concurrent_ocr setting in config.
  • Existing src/universal_iiif_core/services/export/service.py job pattern is a good template.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:studioStudio workspace and tabsminorIncrements the minor version when adding new functionality in a backward-compatible manner.priority:P1High prioritystatus:readyReady to be implementedtype:featureNew user-facing feature

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions