Skip to content

feat(iccu): first-class Internet Culturale provider#159

Merged
nikazzio merged 6 commits intomainfrom
feat/iccu-internetculturale-provider
Apr 19, 2026
Merged

feat(iccu): first-class Internet Culturale provider#159
nikazzio merged 6 commits intomainfrom
feat/iccu-internetculturale-provider

Conversation

@nikazzio
Copy link
Copy Markdown
Owner

@nikazzio nikazzio commented Apr 18, 2026

Summary

  • Onboard Internet Culturale (ICCU) as a first-class Discovery provider instead of a minimal resolver stub.
  • Fetch MAG/XML upstream, convert to IIIF v2 on the fly, and wire manifest/preview/download/viewer/Studio flows through it.
  • Native pagination (pag=N, 20/page), total-result extraction, per-library network policy, dedicated Network & Libraries settings card, and full docs coverage.

Why

ICCU is the gateway for ~50 Italian libraries — Laurenziana, Marciana, BNCF/BNCR, Estense, Marucelliana and many smaller partners — and none of them expose their content via native IIIF. Making ICCU usable end-to-end is what unlocks the Italian corpus in Scriptoria.

What changed

Core pipeline

  • resolvers/mag_parser.py — canvas image URLs now come from the real <page src=...> attribute (the /jmms/thumbnail?page=N endpoint silently ignores the page parameter). build_viewer_url points at the canonical /jmms/iccuviewer/iccu.jsp (the older viewresource URL is blank for several teche, including BNCF). New probe_magparser_url for the discovery probe flow.
  • resolvers/manifest_fetch.py (new) — shared fetch_manifest_dict() that dispatches ICCU URLs to fetch_and_convert and leaves native IIIF on HTTPClient.get_json. Used by analyze_manifest, persist_prefetch_light, _quick_manifest_has_native_pdf, Studio remote read, and Library metadata refresh.
  • resolvers/search/internetculturale.py — native pagination via pag=N, parse of "Pagina X di Y (Z risultati trovati)" into raw meta.
  • logic/downloader.py / downloader_runtime.pyPageDownloader._locate_direct_image_url / _fetch_direct handle canvases with a plain resource.@id (no IIIF service). Manifests carrying the _iccu marker enable allow_partial_finalize so teaser records land the pages that actually exist instead of being discarded.
  • providers.py — ICCU helper text, filter metadata, paginability.
  • Studio workspace.py / manifest_helpers.py — Mirador receives the converted manifest through the new proxy endpoint.

UI

  • routes/discovery.py + routes/discovery_handlers.py — new endpoint GET /api/iccu/manifest?url=... serves the IIIF JSON with CORS headers so Mirador can mount it. internetculturale added to _PAGINATABLE_STRATEGIES; ic_type propagated through resolve_manifest and load_more_results.
  • components/discovery_results.py — "Mostrati X di Y risultati" header when totals are known; ic_type preserved across Load More; ICCU fallback in _provider_viewer_fallback.
  • components/settings/panes/network.py — dedicated Internet Culturale card matching the existing Gallica/Vaticana/Bodleian/Institut pattern.

Network policy

network_policy.py — new internet_culturale key with a conservative custom policy (2 workers, 1.0–3.0 s delay, 300 s cooldown on 403/429, 40 req per 60 s burst). Aliases cover Internet Culturale, ICCU, internetculturale and the display variant.

Tests

tests/test_iccu_unit.py + tests/fixtures/iccu_mag_sample.xml — 12 unit tests covering parser, resolver URL/OAI handling, teca inference, probe fallback and direct-image helper.

Docs

  • docs/reference/provider-support.md — ICCU in registry, matrix, dedicated Per-Provider Notes section, filters list.
  • docs/reference/configuration.md — new library key + rationale of defaults.
  • docs/CONFIG_REFERENCE.md — internet_culturale in supported libraries and in the paginable provider list.
  • docs/guides/discovery-and-library.md — dedicated paragraph on ICCU in the pagination section.

Test plan

  • pytest tests/ — full suite green (live tests skipped as configured)
  • ruff check src/ tests/ clean on all touched files (sole remaining C901 is pre-existing in config_manager.py, unrelated)
  • ruff format src/ tests/ clean
  • Live end-to-end: full download of a 73-page BNCF manuscript (all canvases finalized, zero residuals in temp/)
  • Live end-to-end: partial-finalize path on a teaser record (BNCF CFIE004771) lands the single available scan
  • Live end-to-end: search "dante" returns 20 of 2408 results, page 1/2/3 distinct, zero overlap
  • Live: normalize_library_key("ICCU") → "internet_culturale"; per-library policy resolved correctly
  • UI smoke: "Mostrati X di Y risultati" header + Carica altri + Network pane Internet Culturale card

Refs #105 (Internet Culturale portion of the epic)

nikazzio and others added 4 commits April 18, 2026 23:17
Adds full support for the ICCU aggregator (internetculturale.it) which covers
Biblioteca Medicea Laurenziana, Biblioteca Nazionale Marciana, BNCF, BNCR and
~50+ Italian institutions via the MAG/XML API.

Changes:
- resolvers/mag_parser.py: MAG XML → IIIF v2 manifest converter
  - parse_mag_xml(): parse bibinfo, pages, build standard IIIF v2 manifest
  - IccuMetadata: structured metadata (library, city, sbn_code, shelfmark, oai_id, teca)
  - fetch_and_convert(): download + convert magparser endpoint
  - is_iccu_magparser_url(): intercept hook for downloader pipeline
  - Uses defusedxml for safe XML parsing
- resolvers/internetculturale.py: resolver class
  - can_resolve(): matches internetculturale.it URLs and known OAI prefixes
  - get_manifest_url(): extracts OAI ID + teca, returns magparser URL
  - OAI prefix → teca lookup table (Marciana, BML, MagTeca-ICCU)
- resolvers/search/internetculturale.py: HTML scraper for IC manuscript search
  - Targets /it/16/search?instance=magindice&channel__typeTipo=Manoscritto
  - Extracts OAI ID, teca, title, author, date, library, thumbnail per result
- providers.py: register Internet Culturale provider (sort_order=5, first in UI)
- logic/downloader.py: intercept magparser URLs before get_json() call
- resolvers/discovery.py + search/__init__.py: wire search_internetculturale
- tests/test_providers.py: update sort order assertion

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Default is now 'all' (nessun filtro), cerca tutti i tipi di materiale
- Aggiunto _IC_FILTER con opzioni: tutti, manoscritti, libri a stampa, musica, fotografie
- Aggiunto _make_ic_adapter in providers.py che passa ic_type dal payload
- Corretti bug parser HTML: split per block-item-search-result, titolo da h2.dc_title, autore da h2.dc_creator, OAI ID da span.dc_id

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Turn Internet Culturale (ICCU) into a fully wired Discovery provider rather
than a minimal resolver. The upstream exposes MAG/XML rather than native
IIIF, so the integration fetches the MAG document, converts it to a IIIF v2
manifest on the fly, and drives search, preview, download and viewer flows
through that manifest.

Key points:

- mag_parser: parse real <page src=...> attributes to build per-canvas image
  URLs under /jmms/cacheman/... The /jmms/thumbnail?page=N endpoint silently
  ignores the page parameter and must not be used for pagination.
- manifest_fetch: shared fetch_manifest_dict() that dispatches ICCU URLs to
  fetch_and_convert and leaves native IIIF manifests on HTTPClient.get_json.
  Wired into Discovery preview, add_to_library, add_and_download, PDF
  capability probe and Studio remote read.
- Search: native pagination via pag=N (paginate_pageNum is ignored by the
  server), parse of "Pagina X di Y (Z risultati trovati)" into raw meta,
  surface as "Mostrati X di Y risultati" + Carica altri in the results list.
- Provider registry: internetculturale strategy is paginable, ic_type filter
  survives Load More, helper text explains the aggregator scope.
- Viewer: build_viewer_url now points to the canonical JSP viewer
  /jmms/iccuviewer/iccu.jsp (the older viewresource path renders blank for
  several teche, notably BNCF). External link propagation fixed for preview,
  list cards and library refresh paths.
- Downloader: PageDownloader handles canvases without a IIIF service by
  downloading resource.@id directly; manifests with the _iccu marker enable
  partial finalize so teaser records (more pages declared than served) still
  land the available scans instead of being discarded.
- Internal proxy: new GET /api/iccu/manifest?url=... serves the converted
  manifest as JSON with CORS headers so Mirador can mount it in Studio.
- Network policy: internet_culturale key registered in LIBRARY_KEY_ALIASES
  and LIBRARY_KEYS with a conservative default custom policy (2 workers,
  1.0-3.0s delay, 300s cooldown on 403/429, 40 req per 60s burst).
- Settings UI: dedicated Network & Libraries card for Internet Culturale
  matching the pattern used for Gallica/Vaticana/Bodleian/Institut.
- Tests: unit coverage for MAG parser, resolver URL/OAI handling, teca
  inference, probe fallback and direct-image download helper.

Refs #105
…ork policy

- reference/provider-support.md: add ICCU to the registry listing, the
  provider matrix row, a dedicated Per-Provider Notes section (MAG/XML
  behavior, pag parameter, teaser records, viewer JSP, Image API level 0,
  Estense separation), and list the ic_type filter.
- reference/configuration.md: mention the internet_culturale library key in
  the settings.network.libraries overview and document the conservative
  default policy.
- CONFIG_REFERENCE.md: add internet_culturale to the supported libraries
  list, mark it as a paginable discovery provider, and note the fixed
  upstream page size.
- guides/discovery-and-library.md: extend the pagination section with a
  paragraph on ICCU (aggregator scope, "Mostrati X di Y", partial records,
  on-the-fly manifest conversion).

Refs #105
@nikazzio nikazzio changed the title feat: add Internet Culturale (ICCU) provider feat(iccu): first-class Internet Culturale provider Apr 19, 2026
@nikazzio nikazzio added area:discovery Discovery search and provider resolution area:providers IIIF provider integrations minor Increments the minor version when adding new functionality in a backward-compatible manner. priority:P1 High priority type:feature New user-facing feature and removed minor Increments the minor version when adding new functionality in a backward-compatible manner. labels Apr 19, 2026
…upstream page size

The previous has_more check compared len(results) against settings.discovery.
max_results_per_provider. Internet Culturale has a fixed upstream page size
of 20 and could never satisfy `len >= max_results` when the user bumped the
setting above 20, so the Load More button silently disappeared.

Introduce _compute_has_more, which prefers the provider-reported
_search_total_pages when available (ICCU fills it during HTML scraping) and
falls back to the legacy heuristic otherwise. This keeps pagination
consistent regardless of the configured result cap.

Refs #105
…der list

The ICCU integration is functional but noticeably less reliable than the
native-IIIF providers: upstream records are often teasers with only the
frontispiece actually served, image quality is variable, and there is no
tile/zoom layer server-side. Mark the provider as BETA across the product
surface and stop putting it first in the Discovery select.

- providers.py: label now "Internet Culturale (ICCU) [BETA]", helper text
  and not_found_hint carry BETA disclaimers and a warning about partial
  records, sort_order bumped from 5 to 98 so Vaticana (10) becomes the
  default option again and ICCU sits at the end of the non-generic list.
- Settings Network & Libraries pane: tab button and card title annotated
  with [BETA].
- Docs (provider-support, configuration, CONFIG_REFERENCE, discovery
  guide): ICCU marked as BETA, the registry list order reflects the new
  sort_order, the per-provider note explains the limitations explicitly.
- test_providers: assertion updated to reflect the new sort order.

Refs #105
@nikazzio nikazzio merged commit 5e4bfb5 into main Apr 19, 2026
6 checks passed
@nikazzio nikazzio deleted the feat/iccu-internetculturale-provider branch April 19, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:discovery Discovery search and provider resolution area:providers IIIF provider integrations minor Increments the minor version when adding new functionality in a backward-compatible manner. priority:P1 High priority type:feature New user-facing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant