feat(iccu): first-class Internet Culturale provider#159
Merged
Conversation
Adds full support for the ICCU aggregator (internetculturale.it) which covers Biblioteca Medicea Laurenziana, Biblioteca Nazionale Marciana, BNCF, BNCR and ~50+ Italian institutions via the MAG/XML API. Changes: - resolvers/mag_parser.py: MAG XML → IIIF v2 manifest converter - parse_mag_xml(): parse bibinfo, pages, build standard IIIF v2 manifest - IccuMetadata: structured metadata (library, city, sbn_code, shelfmark, oai_id, teca) - fetch_and_convert(): download + convert magparser endpoint - is_iccu_magparser_url(): intercept hook for downloader pipeline - Uses defusedxml for safe XML parsing - resolvers/internetculturale.py: resolver class - can_resolve(): matches internetculturale.it URLs and known OAI prefixes - get_manifest_url(): extracts OAI ID + teca, returns magparser URL - OAI prefix → teca lookup table (Marciana, BML, MagTeca-ICCU) - resolvers/search/internetculturale.py: HTML scraper for IC manuscript search - Targets /it/16/search?instance=magindice&channel__typeTipo=Manoscritto - Extracts OAI ID, teca, title, author, date, library, thumbnail per result - providers.py: register Internet Culturale provider (sort_order=5, first in UI) - logic/downloader.py: intercept magparser URLs before get_json() call - resolvers/discovery.py + search/__init__.py: wire search_internetculturale - tests/test_providers.py: update sort order assertion Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Default is now 'all' (nessun filtro), cerca tutti i tipi di materiale - Aggiunto _IC_FILTER con opzioni: tutti, manoscritti, libri a stampa, musica, fotografie - Aggiunto _make_ic_adapter in providers.py che passa ic_type dal payload - Corretti bug parser HTML: split per block-item-search-result, titolo da h2.dc_title, autore da h2.dc_creator, OAI ID da span.dc_id Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Turn Internet Culturale (ICCU) into a fully wired Discovery provider rather than a minimal resolver. The upstream exposes MAG/XML rather than native IIIF, so the integration fetches the MAG document, converts it to a IIIF v2 manifest on the fly, and drives search, preview, download and viewer flows through that manifest. Key points: - mag_parser: parse real <page src=...> attributes to build per-canvas image URLs under /jmms/cacheman/... The /jmms/thumbnail?page=N endpoint silently ignores the page parameter and must not be used for pagination. - manifest_fetch: shared fetch_manifest_dict() that dispatches ICCU URLs to fetch_and_convert and leaves native IIIF manifests on HTTPClient.get_json. Wired into Discovery preview, add_to_library, add_and_download, PDF capability probe and Studio remote read. - Search: native pagination via pag=N (paginate_pageNum is ignored by the server), parse of "Pagina X di Y (Z risultati trovati)" into raw meta, surface as "Mostrati X di Y risultati" + Carica altri in the results list. - Provider registry: internetculturale strategy is paginable, ic_type filter survives Load More, helper text explains the aggregator scope. - Viewer: build_viewer_url now points to the canonical JSP viewer /jmms/iccuviewer/iccu.jsp (the older viewresource path renders blank for several teche, notably BNCF). External link propagation fixed for preview, list cards and library refresh paths. - Downloader: PageDownloader handles canvases without a IIIF service by downloading resource.@id directly; manifests with the _iccu marker enable partial finalize so teaser records (more pages declared than served) still land the available scans instead of being discarded. - Internal proxy: new GET /api/iccu/manifest?url=... serves the converted manifest as JSON with CORS headers so Mirador can mount it in Studio. - Network policy: internet_culturale key registered in LIBRARY_KEY_ALIASES and LIBRARY_KEYS with a conservative default custom policy (2 workers, 1.0-3.0s delay, 300s cooldown on 403/429, 40 req per 60s burst). - Settings UI: dedicated Network & Libraries card for Internet Culturale matching the pattern used for Gallica/Vaticana/Bodleian/Institut. - Tests: unit coverage for MAG parser, resolver URL/OAI handling, teca inference, probe fallback and direct-image download helper. Refs #105
…ork policy - reference/provider-support.md: add ICCU to the registry listing, the provider matrix row, a dedicated Per-Provider Notes section (MAG/XML behavior, pag parameter, teaser records, viewer JSP, Image API level 0, Estense separation), and list the ic_type filter. - reference/configuration.md: mention the internet_culturale library key in the settings.network.libraries overview and document the conservative default policy. - CONFIG_REFERENCE.md: add internet_culturale to the supported libraries list, mark it as a paginable discovery provider, and note the fixed upstream page size. - guides/discovery-and-library.md: extend the pagination section with a paragraph on ICCU (aggregator scope, "Mostrati X di Y", partial records, on-the-fly manifest conversion). Refs #105
…upstream page size The previous has_more check compared len(results) against settings.discovery. max_results_per_provider. Internet Culturale has a fixed upstream page size of 20 and could never satisfy `len >= max_results` when the user bumped the setting above 20, so the Load More button silently disappeared. Introduce _compute_has_more, which prefers the provider-reported _search_total_pages when available (ICCU fills it during HTML scraping) and falls back to the legacy heuristic otherwise. This keeps pagination consistent regardless of the configured result cap. Refs #105
…der list The ICCU integration is functional but noticeably less reliable than the native-IIIF providers: upstream records are often teasers with only the frontispiece actually served, image quality is variable, and there is no tile/zoom layer server-side. Mark the provider as BETA across the product surface and stop putting it first in the Discovery select. - providers.py: label now "Internet Culturale (ICCU) [BETA]", helper text and not_found_hint carry BETA disclaimers and a warning about partial records, sort_order bumped from 5 to 98 so Vaticana (10) becomes the default option again and ICCU sits at the end of the non-generic list. - Settings Network & Libraries pane: tab button and card title annotated with [BETA]. - Docs (provider-support, configuration, CONFIG_REFERENCE, discovery guide): ICCU marked as BETA, the registry list order reflects the new sort_order, the per-provider note explains the limitations explicitly. - test_providers: assertion updated to reflect the new sort order. Refs #105
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pag=N, 20/page), total-result extraction, per-library network policy, dedicated Network & Libraries settings card, and full docs coverage.Why
ICCU is the gateway for ~50 Italian libraries — Laurenziana, Marciana, BNCF/BNCR, Estense, Marucelliana and many smaller partners — and none of them expose their content via native IIIF. Making ICCU usable end-to-end is what unlocks the Italian corpus in Scriptoria.
What changed
Core pipeline
resolvers/mag_parser.py— canvas image URLs now come from the real<page src=...>attribute (the/jmms/thumbnail?page=Nendpoint silently ignores the page parameter).build_viewer_urlpoints at the canonical/jmms/iccuviewer/iccu.jsp(the olderviewresourceURL is blank for several teche, including BNCF). Newprobe_magparser_urlfor the discovery probe flow.resolvers/manifest_fetch.py(new) — sharedfetch_manifest_dict()that dispatches ICCU URLs tofetch_and_convertand leaves native IIIF onHTTPClient.get_json. Used byanalyze_manifest,persist_prefetch_light,_quick_manifest_has_native_pdf, Studio remote read, and Library metadata refresh.resolvers/search/internetculturale.py— native pagination viapag=N, parse of "Pagina X di Y (Z risultati trovati)" into raw meta.logic/downloader.py/downloader_runtime.py—PageDownloader._locate_direct_image_url/_fetch_directhandle canvases with a plainresource.@id(no IIIF service). Manifests carrying the_iccumarker enableallow_partial_finalizeso teaser records land the pages that actually exist instead of being discarded.providers.py— ICCU helper text, filter metadata, paginability.workspace.py/manifest_helpers.py— Mirador receives the converted manifest through the new proxy endpoint.UI
routes/discovery.py+routes/discovery_handlers.py— new endpointGET /api/iccu/manifest?url=...serves the IIIF JSON with CORS headers so Mirador can mount it.internetculturaleadded to_PAGINATABLE_STRATEGIES;ic_typepropagated throughresolve_manifestandload_more_results.components/discovery_results.py— "Mostrati X di Y risultati" header when totals are known;ic_typepreserved across Load More; ICCU fallback in_provider_viewer_fallback.components/settings/panes/network.py— dedicated Internet Culturale card matching the existing Gallica/Vaticana/Bodleian/Institut pattern.Network policy
network_policy.py— newinternet_culturalekey with a conservative custom policy (2 workers, 1.0–3.0 s delay, 300 s cooldown on 403/429, 40 req per 60 s burst). Aliases coverInternet Culturale,ICCU,internetculturaleand the display variant.Tests
tests/test_iccu_unit.py+tests/fixtures/iccu_mag_sample.xml— 12 unit tests covering parser, resolver URL/OAI handling, teca inference, probe fallback and direct-image helper.Docs
docs/reference/provider-support.md— ICCU in registry, matrix, dedicated Per-Provider Notes section, filters list.docs/reference/configuration.md— new library key + rationale of defaults.docs/CONFIG_REFERENCE.md— internet_culturale in supported libraries and in the paginable provider list.docs/guides/discovery-and-library.md— dedicated paragraph on ICCU in the pagination section.Test plan
pytest tests/— full suite green (live tests skipped as configured)ruff check src/ tests/clean on all touched files (sole remaining C901 is pre-existing inconfig_manager.py, unrelated)ruff format src/ tests/cleantemp/)normalize_library_key("ICCU") → "internet_culturale"; per-library policy resolved correctlyRefs #105 (Internet Culturale portion of the epic)