Skip to content

perf: release the GIL during s2 boolean operations#115

Open
thodson-usgs wants to merge 8 commits intobenbovy:mainfrom
thodson-usgs:perf-gil-release
Open

perf: release the GIL during s2 boolean operations#115
thodson-usgs wants to merge 8 commits intobenbovy:mainfrom
thodson-usgs:perf-gil-release

Conversation

@thodson-usgs
Copy link
Copy Markdown

@thodson-usgs thodson-usgs commented Apr 22, 2026

⚠️ I am not a CPython expert. Claude identified this performance improvement. I iterated with it for an hour to develop this PR, but before proceeding further, I'm seeking feedback from someone with more experience.

Summary

Wrap s2geog::s2_boolean_operation inside BooleanOp::operator() in a py::gil_scoped_release. Callers can then parallelize boolean ops across Python threads; index extraction stays under the GIL and only the pure-C++ s2 work runs with the GIL released.

 PyObjectGeography operator()(PyObjectGeography a, PyObjectGeography b) const {
     const auto& a_index = a.as_geog_ptr()->geog_index();
     const auto& b_index = b.as_geog_ptr()->geog_index();
-    std::unique_ptr<s2geog::Geography> geog_out =
-        s2geog::s2_boolean_operation(a_index, b_index, m_op_type, m_options);
+    std::unique_ptr<s2geog::Geography> geog_out;
+    {
+        py::gil_scoped_release release;
+        geog_out = s2geog::s2_boolean_operation(a_index, b_index, m_op_type, m_options);
+    }
     return make_py_geography(std::move(geog_out));
 }

Why this is safe. The released-GIL block only reads the two const S2ShapeIndex& references, and MutableS2ShapeIndex documents this pattern as thread-safe in s2/mutable_s2shape_index.h:

For efficiency, updates are batched together and applied lazily on the first subsequent query. Locking is used to ensure that MutableS2ShapeIndex has the same thread-safety properties as "vector": const methods are thread-safe, while non-const methods are not thread-safe.

That covers both the pre-built case and the lazy-first-build case; in the latter, s2's internal wait-mutex serializes the one-time build across threads.

Motivation

In a downstream gridding experiment, spherely.intersection over a large object array dominated the total wall time. py::vectorize holds the GIL throughout the batched call, so chunking the array across a ThreadPoolExecutor gave no speedup on its own. Releasing the GIL around the s2 work lets a batched call fan out across threads — in that experiment, end-to-end setup time dropped by about half.

Measured speedups

macOS arm64, 12-core M-series (6 P-cores + 6 E-cores), Python 3.13. Median of 3 runs of the benchmark script below.

threads main patched speedup
1 995 ms 1032 ms
2 1038 ms 575 ms 1.80×
4 997 ms 395 ms 2.61×
8 1017 ms 628 ms 1.64×

Unpatched main does not scale with threads, as expected; single-threaded runtime is unchanged within noise.

Benchmark

Paste into a file and run against the installed spherely. No dependencies beyond spherely + numpy.

"""Minimal benchmark for PR #115 — threaded scaling of spherely.intersection."""

import time
from concurrent.futures import ThreadPoolExecutor

import numpy as np
import spherely


def grid_cells(ny, nx, lat=(-80.0, 80.0), lon=(-175.0, 175.0)):
    lats = np.linspace(lat[0], lat[1], ny + 1)
    lons = np.linspace(lon[0], lon[1], nx + 1)
    out = np.empty(ny * nx, dtype=object)
    for j in range(ny):
        for i in range(nx):
            shell = [
                (lons[i], lats[j]),
                (lons[i + 1], lats[j]),
                (lons[i + 1], lats[j + 1]),
                (lons[i], lats[j + 1]),
            ]
            out[j * nx + i] = spherely.create_polygon(shell, oriented=True)
    return out


# Two same-resolution grids, slightly offset so every element-wise pair
# overlaps. Replicated 25x for ~1 s serial runtime.
tgt = grid_cells(50, 100)
src = grid_cells(50, 100, lat=(-78.5, 81.5), lon=(-173.0, 177.0))
dst = np.tile(tgt, 25)
src_pairs = np.tile(src, 25)
n = len(dst)


def serial():
    spherely.intersection(dst, src_pairs)


def threaded(workers):
    splits = np.array_split(np.arange(n), workers)
    with ThreadPoolExecutor(max_workers=workers) as pool:
        list(pool.map(lambda s: spherely.intersection(dst[s], src_pairs[s]), splits))


def median_ms(fn, trials=3):
    fn()  # warm
    runs = []
    for _ in range(trials):
        t0 = time.perf_counter()
        fn()
        runs.append(time.perf_counter() - t0)
    return 1000 * sorted(runs)[len(runs) // 2]


print(f"1 thread  : {median_ms(serial):6.0f} ms")
for w in (2, 4, 8):
    print(f"{w} threads : {median_ms(lambda w=w: threaded(w)):6.0f} ms")

Tests

tests/test_boolean_operations_concurrency.py adds two tests (5 items with parametrization) that target the failure modes specific to this change:

  • test_concurrent_shared_inputs_match_serial, parametrized over the four boolean ops, exercises concurrent reads on shared index state. Each op runs on a 4-thread ThreadPoolExecutor behind a threading.Barrier so all threads enter the released-GIL window simultaneously; outputs are compared bit-for-bit against a serial reference.
  • test_concurrent_lazy_index_race does the same but on freshly-constructed Geographies, to surface any first-access materialization inside the index object that runs with the GIL released.

Open questions for the maintainer

  1. TSAN. I built spherely with -fsanitize=thread against the conda-forge s2geography / s2geometry and ran the concurrency tests; no warnings. Caveat: TSAN only instruments what was compiled with the flag, and conda-forge's s2geography is not — so the wrapper is covered but the index internals aren't. Wiring TSAN into CI would need either a TSAN-built s2geography variant (doesn't exist on conda-forge) or building s2geography from source in the CI job. Left out of CI for now; happy to add it gated on merge if you'd like.
  2. Scope. Only BooleanOp is patched. Predicates (intersects, contains, within, touches) go through the same index-based code path and would benefit identically; happy to expand in this PR or leave for a follow-up.

Appendix: performance canary

Not included in the PR — this test ran a 4000-pair overlapping-polygon workload across a ThreadPoolExecutor to directly confirm the GIL was released. It's CPU-intensive and timing-sensitive (reliable on workstation hardware, flaky on shared CI runners). Reproduced here for anyone who wants to verify the release locally.

@pytest.mark.slow
def test_gil_release_actually_enables_parallelism():
    """Directly confirms the GIL was released: if it wasn't, the threaded
    path serializes on the GIL and runs no faster than a single thread."""
    if N_THREADS < 2:
        pytest.skip("need at least 2 cores for a meaningful speedup measurement")

    big1 = spherely.from_wkt("POLYGON ((-80 -40, 80 -40, 80 40, -80 40, -80 -40))")
    big2 = spherely.from_wkt("POLYGON ((-40 -80, 40 -80, 40 80, -40 80, -40 -80))")
    n = 4000
    dst = np.array([big1] * n, dtype=object)
    src = np.array([big2] * n, dtype=object)

    def serial():
        spherely.intersection(dst, src)

    def threaded():
        splits = np.array_split(np.arange(n), N_THREADS)

        def worker(idx):
            spherely.intersection(dst[idx], src[idx])

        with ThreadPoolExecutor(max_workers=N_THREADS) as pool:
            list(pool.map(worker, splits))

    def median_of(fn, trials=3):
        serial()  # warm any caches
        samples = []
        for _ in range(trials):
            t0 = time.perf_counter()
            fn()
            samples.append(time.perf_counter() - t0)
        return sorted(samples)[len(samples) // 2]

    t_serial = median_of(serial)
    t_threaded = median_of(threaded)
    # Expect at least 1.3× on 4 cores. Well below the theoretical 4× to
    # tolerate bench noise, but any number <1.1 means the release almost
    # certainly did not land.
    assert t_threaded < t_serial / 1.3, (
        f"threaded={t_threaded * 1000:.1f}ms was not meaningfully faster than "
        f"serial={t_serial * 1000:.1f}ms — GIL release may not be in effect"
    )

On the 12-core M-series machine used for the speedup table above it passes with threaded ≈ 2.6× serial; on unpatched main it fails with threaded ≈ serial within noise.

thodson-usgs and others added 8 commits April 21, 2026 17:49
BooleanOp::operator() extracts the geography indices under the GIL
(needed for Python ref access) and then wraps the s2geography call in
py::gil_scoped_release so callers can parallelise across Python threads
with ThreadPoolExecutor. The fetched index references are owned by the
input Geographies which py::vectorize keeps alive for the duration of
the batched call, so no Python state is touched during the release.

Measured with a downstream xarray-regrid s2 build (180x360 -> 60x120
lat/lon, 114720 candidate pairs) on a 12-core M-series Mac:

  serial         1115 ms
  threads=2       641 ms  (1.74x)
  threads=4       473 ms  (2.36x)
  threads=8       663 ms

End-to-end regridder __init__ drops from 1507 ms to 787 ms (-48%).
Serial performance is unchanged (1126 ms on master vs 1115 ms with
the patch, within run-to-run noise).
Nine tests in tests/test_boolean_operations_concurrency.py that cover
the failure modes the gil_scoped_release change could introduce:

Correctness (8 tests):
  - Threaded vs serial output match for all four boolean ops on shared
    input arrays (intersection / union / difference / symmetric_difference).
  - Lazy-index race: fresh Geographies hit simultaneously from N threads
    via a threading.Barrier to maximise first-access contention.
  - Mixed operations on the same inputs from concurrent threads.
  - Python-side GC/allocation churn happening in parallel with the
    released-GIL boolean op.
  - Input Geography refcounts unchanged after many concurrent runs.

Performance canary (1 test, @pytest.mark.slow):
  - Asserts threaded < serial / 1.3 on a 4k-pair all-overlapping workload.
    Fails on unpatched main (threaded ~ serial within noise) and passes
    on the patched branch (~2.4x on 4 cores). Serves as a build-time
    regression check for the release itself, independent of correctness.

Measured on the patched branch: all 9 pass in ~16 s on macOS arm64 /
Python 3.14 / 12 cores. On unmodified main, the canary fails in ~1 s
with \`assert 0.0956 < (0.0948 / 1.3)\` - a clear signal that the release
is missing.

Recommended: run the whole file under ThreadSanitizer in CI. The
docstring at the top notes the CFLAGS incantation.
No behavior change. Long lines wrapped / unwrapped to match the
project's configured black and clang-format styles; misplaced import
moved above its use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The perf canary test_gil_release_actually_enables_parallelism asserts
at least a 1.3x speedup from threading. It's reliable on workstation-
class hardware (2-2.4x on a 12-core M-series Mac) but the shared
GitHub Actions Ubuntu runners deliver 1.22-1.27x — below the bar,
above machine-level noise. The @pytest.mark.slow decorator on the
test was intended to keep it out of CI but was never wired to a
deselect mechanism, and the marker itself wasn't registered (pytest
emitted PytestUnknownMarkWarning in every CI run).

Register the slow marker in pyproject.toml and add a conftest.py that
skips slow-marked tests unless --run-slow is passed. The canary stays
runnable on demand (pytest --run-slow) but no longer gates CI on a
flaky perf threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lock's self-referenced spherely sdist hash captures the pyproject.toml
content, so any edit to that file (including the new [tool.pytest.ini_options]
section) invalidates the stored sha256 and fails `pixi install --locked`
in the `Tests via pixi` CI job. Regenerate the hash; no dependency changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove test_gil_release_actually_enables_parallelism. The threaded vs
serial timing threshold is too tight for shared CI runners and the
canary is not needed as a gate -- the eight remaining correctness and
stress tests already confirm the release doesn't corrupt output.

With the canary gone, the @pytest.mark.slow registration and the
conftest.py that gated it are no longer needed. Also reverts the
pyproject.toml [tool.pytest.ini_options] addition and the
pixi.lock self-hash update that the manifest edit required.

The remaining concurrency test file is trimmed to match the style
of the rest of tests/: no module docstring, no section dividers,
short or omitted test docstrings, no __future__ import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove three tests whose failure modes are invariants of pybind11 or
Python, not of the gil_scoped_release wrap added here:

  - test_concurrent_mixed_operations: different ops on shared inputs
    is the same s2 code path as the parametrized shared-inputs test,
    just less focused.
  - test_intersection_with_parallel_python_churn: tests the release/
    re-acquire boundary against concurrent GC, but that boundary is a
    pybind11 contract. A failure here would indicate a pybind11 bug,
    not a spherely one.
  - test_refcounts_stable_after_concurrent_runs: the change doesn't
    touch refcount logic; no plausible way for a leak to appear here
    that wouldn't also break the non-threaded path.

What remains targets the two concerns specific to this change:

  - test_concurrent_shared_inputs_match_serial (4 parametrized ops):
    concurrent reads on shared index state must not corrupt output.
  - test_concurrent_lazy_index_race: first-access materialization
    inside the index object during the released-GIL window on fresh
    Geographies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Geography::geog_index() is non-const: on first access it lazily
populates its backing unique_ptr. That write must happen under the
GIL, which means both index references have to be fetched above the
gil_scoped_release scope, not inside it. Add a two-line comment so a
future edit doesn't reorder those lines and race the lazy init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs marked this pull request as ready for review April 22, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant