Skip to content

Cloud storage support (Azure, S3, GCS)#1087

Open
SamirMoustafa wants to merge 14 commits intoscverse:mainfrom
SamirMoustafa:cloud-storage-support
Open

Cloud storage support (Azure, S3, GCS)#1087
SamirMoustafa wants to merge 14 commits intoscverse:mainfrom
SamirMoustafa:cloud-storage-support

Conversation

@SamirMoustafa
Copy link

@SamirMoustafa SamirMoustafa commented Mar 2, 2026

Cloud storage support (Azure, S3, GCS)

Summary

Add read/write support for SpatialData on remote object storage via UPath, fixing the issue reported in #999 where sd.read_zarr(UPath("s3://...")) failed because SpatialData.path did not accept UPath. PR #971 ("add remote support") pursued the same goal but remains a draft, blocked on zarr v3/ome-zarr and async fsspec after dask unpinning. This PR delivers working remote support by fixing the path setter, wrapping fsspec in an async filesystem where required for current zarr, and testing Azure, S3, and GCS via Docker emulators. It also addresses #441 (private remote object storage): credentials go via UPath kwargs or via a pre-opened zarr.Group (e.g. read_zarr(zarr.open("s3://...", storage_options={...}))).

Supported features

  • Path handling: SpatialData.path accepts None, str, Path, or UPath (enables remote-backed objects).
  • Read: SpatialData.read(upath) and read_zarr(upath) for Azure Blob (az://), S3 (s3://), and GCS (gs://) using universal-pathlib (UPath). For private stores, read_zarr(zarr_group) is also supported when the store is opened with zarr.open(..., storage_options=...).
  • Write: sdata.write(upath) and element-level writes to the same backends; parquet (points/shapes) and zarr (raster/tables) written via fsspec with async filesystem support where required.
  • Consolidated metadata: Read/write of consolidated metadata on remote stores (e.g. zmetadata) supported.

Testing

Remote storage is tested with Docker-based emulators (Azurite for Azure, moto for S3, fake-gcs-server for GCS). In CI we build tests/io/remote_storage/Dockerfile.emulators, start the emulators on Ubuntu, then run the full test suite including tests/io/remote_storage/. These remote-storage tests run only on Ubuntu (Linux), because they depend on Docker; on Windows and macOS we skip tests/io/remote_storage/ and run the rest of the suite. To run the remote tests locally you need Docker and can start the emulators with the same image and ports (5000, 10000, 4443) as in the workflow.

Example (three providers)

from upath import UPath
from spatialdata import SpatialData

# Azure Blob Storage
az_path = UPath("az://my-container/data.zarr", connection_string="<your-connection-string>")
sdata = SpatialData.read(az_path)

# Amazon S3 (e.g. public bucket or custom endpoint)
s3_path = UPath(
    "s3://bucket/data.zarr",
    endpoint_url="https://s3.embl.de",  # omit for default AWS
    anon=True,
)
sdata = SpatialData.read(s3_path)

# Google Cloud Storage
gs_path = UPath("gs://my-bucket/data.zarr", token="anon", project="my-project")
sdata = SpatialData.read(gs_path)

# Write works the same way (any provider)
# sdata.write(az_path)

Credentials and options are passed through UPath (e.g. connection_string, endpoint_url, anon, token, project) as supported by the underlying fsspec backend.


Release notes

  • Add cloud storage support: read and write SpatialData from/to Azure Blob, S3, and GCS using UPath. SpatialData.path now accepts UPath in addition to str and Path. Fixes initialization from remote stores (e.g. S3) as in #999. Addresses #441 (private remote object storage)

SamirMoustafa and others added 13 commits February 28, 2026 02:13
Patch da.to_zarr so ome_zarr's **kwargs are forwarded as zarr_array_kwargs,
avoiding FutureWarning and keeping behavior correct.
- _FsspecStoreRoot, _get_store_root for path-like store roots (local + fsspec)
- _storage_options_from_fs for parquet writes to Azure/S3/GCS
- _remote_zarr_store_exists, _ensure_async_fs for UPath/FsspecStore
- Extend _resolve_zarr_store for UPath and _FsspecStoreRoot with async fs
- _backed_elements_contained_in_path, _is_element_self_contained accept UPath
- path and _path accept Path | UPath; setter allows UPath
- write() accepts file_path: str | Path | UPath | None (None uses path)
- _validate_can_safely_write_to_path handles UPath and remote store existence
- _write_element accepts Path | UPath; skip local subfolder checks for UPath
- __repr__ and _get_groups_for_element use path without forcing Path()
…table, zarr

- Resolve store via _resolve_zarr_store in read paths (points, shapes, raster, table)
- Use _get_store_root for parquet paths; read/write parquet with storage_options for fsspec
- io_shapes: upload parquet to Azure/S3/GCS via temp file when path is _FsspecStoreRoot
- io_zarr: _get_store_root, UPath in _get_groups_for_element and _write_consolidated_metadata; set sdata.path to UPath when store is remote
- pyproject.toml: adlfs, gcsfs, moto[server], pytest-timeout in test extras
- Dockerfile.emulators: moto, Azurite, fake-gcs-server for tests/io/remote_storage/
… emulator config

- full_sdata fixture: two regions for table categorical (avoids 404 on remote read)
- tests/io/remote_storage/conftest.py: bucket/container creation, resilient async shutdown
- tests/io/remote_storage/test_remote_storage.py: parametrized Azure/S3/GCS roundtrip and write tests
- Added "dimension_separator" to the frozenset of internal keys that should not be passed to zarr.Group.create_array(), ensuring compatibility with various zarr versions.
- Updated test to set region labels for full_sdata table, allowing the test_set_table_annotates_spatialelement to succeed without errors.
- Updated the `test_subset` function to exclude labels and poly from the default table, ensuring accurate subset validation.
- Enhanced `test_validate_table_in_spatialdata` to assert that both regions (labels2d and poly) are correctly annotated in the table.
- Adjusted `test_labels_table_joins` to restrict the table to labels2d, ensuring the join returns the expected results.
…inux

- Added steps to build and run storage emulators (S3, Azure, GCS) using Docker, specifically for the Ubuntu environment.
- Implemented a wait mechanism to ensure emulators are ready before running tests.
- Adjusted test execution to skip remote storage tests on non-Linux platforms.
- Wrapped the fsspec async sync function to prevent RuntimeError "Loop is not running" during process exit when using remote storage (Azure, S3, GCS).
- Ensured compatibility with async session management in the _utils module.
@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 85.45455% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.85%. Comparing base (094b869) to head (42c3133).

Files with missing lines Patch % Lines
src/spatialdata/_io/_utils.py 75.25% 24 Missing ⚠️
src/spatialdata/_core/spatialdata.py 73.91% 6 Missing ⚠️
src/spatialdata/_io/io_shapes.py 98.21% 1 Missing ⚠️
src/spatialdata/_io/io_zarr.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1087      +/-   ##
==========================================
- Coverage   91.96%   91.85%   -0.12%     
==========================================
  Files          51       52       +1     
  Lines        7729     7907     +178     
==========================================
+ Hits         7108     7263     +155     
- Misses        621      644      +23     
Files with missing lines Coverage Δ
src/spatialdata/_io/__init__.py 100.00% <100.00%> (ø)
src/spatialdata/_io/_dask_zarr_compat.py 100.00% <100.00%> (ø)
src/spatialdata/_io/io_points.py 98.03% <100.00%> (+0.31%) ⬆️
src/spatialdata/_io/io_raster.py 93.89% <100.00%> (ø)
src/spatialdata/_io/io_table.py 90.90% <100.00%> (+0.43%) ⬆️
src/spatialdata/_io/io_shapes.py 96.12% <98.21%> (+1.18%) ⬆️
src/spatialdata/_io/io_zarr.py 92.52% <90.90%> (+0.14%) ⬆️
src/spatialdata/_core/spatialdata.py 91.80% <73.91%> (-0.13%) ⬇️
src/spatialdata/_io/_utils.py 84.35% <75.25%> (-2.40%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant