perf: release the GIL during s2 boolean operations by thodson-usgs · Pull Request #115 · benbovy/spherely

thodson-usgs · 2026-04-22T00:11:25Z

⚠️ I am not a CPython expert. Claude identified this performance improvement. I iterated with it for an hour to develop this PR, but before proceeding further, I'm seeking feedback from someone with more experience.

Summary

Wrap s2geog::s2_boolean_operation inside BooleanOp::operator() in a py::gil_scoped_release. Callers can then parallelize boolean ops across Python threads; index extraction stays under the GIL and only the pure-C++ s2 work runs with the GIL released.

 PyObjectGeography operator()(PyObjectGeography a, PyObjectGeography b) const {
     const auto& a_index = a.as_geog_ptr()->geog_index();
     const auto& b_index = b.as_geog_ptr()->geog_index();
-    std::unique_ptr<s2geog::Geography> geog_out =
-        s2geog::s2_boolean_operation(a_index, b_index, m_op_type, m_options);
+    std::unique_ptr<s2geog::Geography> geog_out;
+    {
+        py::gil_scoped_release release;
+        geog_out = s2geog::s2_boolean_operation(a_index, b_index, m_op_type, m_options);
+    }
     return make_py_geography(std::move(geog_out));
 }

Why this is safe. The released-GIL block only reads the two const S2ShapeIndex& references, and MutableS2ShapeIndex documents this pattern as thread-safe in s2/mutable_s2shape_index.h:

For efficiency, updates are batched together and applied lazily on the first subsequent query. Locking is used to ensure that MutableS2ShapeIndex has the same thread-safety properties as "vector": const methods are thread-safe, while non-const methods are not thread-safe.

That covers both the pre-built case and the lazy-first-build case; in the latter, s2's internal wait-mutex serializes the one-time build across threads.

Motivation

In a downstream gridding experiment, spherely.intersection over a large object array dominated the total wall time. py::vectorize holds the GIL throughout the batched call, so chunking the array across a ThreadPoolExecutor gave no speedup on its own. Releasing the GIL around the s2 work lets a batched call fan out across threads — in that experiment, end-to-end setup time dropped by about half.

Measured speedups

macOS arm64, 12-core M-series (6 P-cores + 6 E-cores), Python 3.13. Median of 3 runs of the benchmark script below.

threads	main	patched	speedup
1	995 ms	1032 ms	—
2	1038 ms	575 ms	1.80×
4	997 ms	395 ms	2.61×
8	1017 ms	628 ms	1.64×

Unpatched main does not scale with threads, as expected; single-threaded runtime is unchanged within noise.

Benchmark

Paste into a file and run against the installed spherely. No dependencies beyond spherely + numpy.

"""Minimal benchmark for PR #115 — threaded scaling of spherely.intersection."""

import time
from concurrent.futures import ThreadPoolExecutor

import numpy as np
import spherely


def grid_cells(ny, nx, lat=(-80.0, 80.0), lon=(-175.0, 175.0)):
    lats = np.linspace(lat[0], lat[1], ny + 1)
    lons = np.linspace(lon[0], lon[1], nx + 1)
    out = np.empty(ny * nx, dtype=object)
    for j in range(ny):
        for i in range(nx):
            shell = [
                (lons[i], lats[j]),
                (lons[i + 1], lats[j]),
                (lons[i + 1], lats[j + 1]),
                (lons[i], lats[j + 1]),
            ]
            out[j * nx + i] = spherely.create_polygon(shell, oriented=True)
    return out


# Two same-resolution grids, slightly offset so every element-wise pair
# overlaps. Replicated 25x for ~1 s serial runtime.
tgt = grid_cells(50, 100)
src = grid_cells(50, 100, lat=(-78.5, 81.5), lon=(-173.0, 177.0))
dst = np.tile(tgt, 25)
src_pairs = np.tile(src, 25)
n = len(dst)


def serial():
    spherely.intersection(dst, src_pairs)


def threaded(workers):
    splits = np.array_split(np.arange(n), workers)
    with ThreadPoolExecutor(max_workers=workers) as pool:
        list(pool.map(lambda s: spherely.intersection(dst[s], src_pairs[s]), splits))


def median_ms(fn, trials=3):
    fn()  # warm
    runs = []
    for _ in range(trials):
        t0 = time.perf_counter()
        fn()
        runs.append(time.perf_counter() - t0)
    return 1000 * sorted(runs)[len(runs) // 2]


print(f"1 thread  : {median_ms(serial):6.0f} ms")
for w in (2, 4, 8):
    print(f"{w} threads : {median_ms(lambda w=w: threaded(w)):6.0f} ms")

Tests

tests/test_boolean_operations_concurrency.py adds two tests (5 items with parametrization) that target the failure modes specific to this change:

test_concurrent_shared_inputs_match_serial, parametrized over the four boolean ops, exercises concurrent reads on shared index state. Each op runs on a 4-thread ThreadPoolExecutor behind a threading.Barrier so all threads enter the released-GIL window simultaneously; outputs are compared bit-for-bit against a serial reference.
test_concurrent_lazy_index_race does the same but on freshly-constructed Geographies, to surface any first-access materialization inside the index object that runs with the GIL released.

Open questions for the maintainer

TSAN. I built spherely with -fsanitize=thread against the conda-forge s2geography / s2geometry and ran the concurrency tests; no warnings. Caveat: TSAN only instruments what was compiled with the flag, and conda-forge's s2geography is not — so the wrapper is covered but the index internals aren't. Wiring TSAN into CI would need either a TSAN-built s2geography variant (doesn't exist on conda-forge) or building s2geography from source in the CI job. Left out of CI for now; happy to add it gated on merge if you'd like.
Scope. Only BooleanOp is patched. Predicates (intersects, contains, within, touches) go through the same index-based code path and would benefit identically; happy to expand in this PR or leave for a follow-up.

Appendix: performance canary

Not included in the PR — this test ran a 4000-pair overlapping-polygon workload across a ThreadPoolExecutor to directly confirm the GIL was released. It's CPU-intensive and timing-sensitive (reliable on workstation hardware, flaky on shared CI runners). Reproduced here for anyone who wants to verify the release locally.

@pytest.mark.slow
def test_gil_release_actually_enables_parallelism():
    """Directly confirms the GIL was released: if it wasn't, the threaded
    path serializes on the GIL and runs no faster than a single thread."""
    if N_THREADS < 2:
        pytest.skip("need at least 2 cores for a meaningful speedup measurement")

    big1 = spherely.from_wkt("POLYGON ((-80 -40, 80 -40, 80 40, -80 40, -80 -40))")
    big2 = spherely.from_wkt("POLYGON ((-40 -80, 40 -80, 40 80, -40 80, -40 -80))")
    n = 4000
    dst = np.array([big1] * n, dtype=object)
    src = np.array([big2] * n, dtype=object)

    def serial():
        spherely.intersection(dst, src)

    def threaded():
        splits = np.array_split(np.arange(n), N_THREADS)

        def worker(idx):
            spherely.intersection(dst[idx], src[idx])

        with ThreadPoolExecutor(max_workers=N_THREADS) as pool:
            list(pool.map(worker, splits))

    def median_of(fn, trials=3):
        serial()  # warm any caches
        samples = []
        for _ in range(trials):
            t0 = time.perf_counter()
            fn()
            samples.append(time.perf_counter() - t0)
        return sorted(samples)[len(samples) // 2]

    t_serial = median_of(serial)
    t_threaded = median_of(threaded)
    # Expect at least 1.3× on 4 cores. Well below the theoretical 4× to
    # tolerate bench noise, but any number <1.1 means the release almost
    # certainly did not land.
    assert t_threaded < t_serial / 1.3, (
        f"threaded={t_threaded * 1000:.1f}ms was not meaningfully faster than "
        f"serial={t_serial * 1000:.1f}ms — GIL release may not be in effect"
    )

On the 12-core M-series machine used for the speedup table above it passes with threaded ≈ 2.6× serial; on unpatched main it fails with threaded ≈ serial within noise.

BooleanOp::operator() extracts the geography indices under the GIL (needed for Python ref access) and then wraps the s2geography call in py::gil_scoped_release so callers can parallelise across Python threads with ThreadPoolExecutor. The fetched index references are owned by the input Geographies which py::vectorize keeps alive for the duration of the batched call, so no Python state is touched during the release. Measured with a downstream xarray-regrid s2 build (180x360 -> 60x120 lat/lon, 114720 candidate pairs) on a 12-core M-series Mac: serial 1115 ms threads=2 641 ms (1.74x) threads=4 473 ms (2.36x) threads=8 663 ms End-to-end regridder __init__ drops from 1507 ms to 787 ms (-48%). Serial performance is unchanged (1126 ms on master vs 1115 ms with the patch, within run-to-run noise).

Nine tests in tests/test_boolean_operations_concurrency.py that cover the failure modes the gil_scoped_release change could introduce: Correctness (8 tests): - Threaded vs serial output match for all four boolean ops on shared input arrays (intersection / union / difference / symmetric_difference). - Lazy-index race: fresh Geographies hit simultaneously from N threads via a threading.Barrier to maximise first-access contention. - Mixed operations on the same inputs from concurrent threads. - Python-side GC/allocation churn happening in parallel with the released-GIL boolean op. - Input Geography refcounts unchanged after many concurrent runs. Performance canary (1 test, @pytest.mark.slow): - Asserts threaded < serial / 1.3 on a 4k-pair all-overlapping workload. Fails on unpatched main (threaded ~ serial within noise) and passes on the patched branch (~2.4x on 4 cores). Serves as a build-time regression check for the release itself, independent of correctness. Measured on the patched branch: all 9 pass in ~16 s on macOS arm64 / Python 3.14 / 12 cores. On unmodified main, the canary fails in ~1 s with \`assert 0.0956 < (0.0948 / 1.3)\` - a clear signal that the release is missing. Recommended: run the whole file under ThreadSanitizer in CI. The docstring at the top notes the CFLAGS incantation.

No behavior change. Long lines wrapped / unwrapped to match the project's configured black and clang-format styles; misplaced import moved above its use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The perf canary test_gil_release_actually_enables_parallelism asserts at least a 1.3x speedup from threading. It's reliable on workstation- class hardware (2-2.4x on a 12-core M-series Mac) but the shared GitHub Actions Ubuntu runners deliver 1.22-1.27x — below the bar, above machine-level noise. The @pytest.mark.slow decorator on the test was intended to keep it out of CI but was never wired to a deselect mechanism, and the marker itself wasn't registered (pytest emitted PytestUnknownMarkWarning in every CI run). Register the slow marker in pyproject.toml and add a conftest.py that skips slow-marked tests unless --run-slow is passed. The canary stays runnable on demand (pytest --run-slow) but no longer gates CI on a flaky perf threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lock's self-referenced spherely sdist hash captures the pyproject.toml content, so any edit to that file (including the new [tool.pytest.ini_options] section) invalidates the stored sha256 and fails `pixi install --locked` in the `Tests via pixi` CI job. Regenerate the hash; no dependency changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove test_gil_release_actually_enables_parallelism. The threaded vs serial timing threshold is too tight for shared CI runners and the canary is not needed as a gate -- the eight remaining correctness and stress tests already confirm the release doesn't corrupt output. With the canary gone, the @pytest.mark.slow registration and the conftest.py that gated it are no longer needed. Also reverts the pyproject.toml [tool.pytest.ini_options] addition and the pixi.lock self-hash update that the manifest edit required. The remaining concurrency test file is trimmed to match the style of the rest of tests/: no module docstring, no section dividers, short or omitted test docstrings, no __future__ import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove three tests whose failure modes are invariants of pybind11 or Python, not of the gil_scoped_release wrap added here: - test_concurrent_mixed_operations: different ops on shared inputs is the same s2 code path as the parametrized shared-inputs test, just less focused. - test_intersection_with_parallel_python_churn: tests the release/ re-acquire boundary against concurrent GC, but that boundary is a pybind11 contract. A failure here would indicate a pybind11 bug, not a spherely one. - test_refcounts_stable_after_concurrent_runs: the change doesn't touch refcount logic; no plausible way for a leak to appear here that wouldn't also break the non-threaded path. What remains targets the two concerns specific to this change: - test_concurrent_shared_inputs_match_serial (4 parametrized ops): concurrent reads on shared index state must not corrupt output. - test_concurrent_lazy_index_race: first-access materialization inside the index object during the released-GIL window on fresh Geographies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Geography::geog_index() is non-const: on first access it lazily populates its backing unique_ptr. That write must happen under the GIL, which means both index references have to be fetched above the gil_scoped_release scope, not inside it. Add a two-line comment so a future edit doesn't reorder those lines and race the lazy init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs and others added 8 commits April 21, 2026 17:49

Apply pre-commit formatting (black + clang-format)

535730f

No behavior change. Long lines wrapped / unwrapped to match the project's configured black and clang-format styles; misplaced import moved above its use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs marked this pull request as ready for review April 22, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: release the GIL during s2 boolean operations#115

perf: release the GIL during s2 boolean operations#115
thodson-usgs wants to merge 8 commits intobenbovy:mainfrom
thodson-usgs:perf-gil-release

thodson-usgs commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Measured speedups

Benchmark

Tests

Open questions for the maintainer

Appendix: performance canary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented Apr 22, 2026 •

edited

Loading