perf: release the GIL during s2 boolean operations#115
Open
thodson-usgs wants to merge 8 commits intobenbovy:mainfrom
Open
perf: release the GIL during s2 boolean operations#115thodson-usgs wants to merge 8 commits intobenbovy:mainfrom
thodson-usgs wants to merge 8 commits intobenbovy:mainfrom
Conversation
BooleanOp::operator() extracts the geography indices under the GIL (needed for Python ref access) and then wraps the s2geography call in py::gil_scoped_release so callers can parallelise across Python threads with ThreadPoolExecutor. The fetched index references are owned by the input Geographies which py::vectorize keeps alive for the duration of the batched call, so no Python state is touched during the release. Measured with a downstream xarray-regrid s2 build (180x360 -> 60x120 lat/lon, 114720 candidate pairs) on a 12-core M-series Mac: serial 1115 ms threads=2 641 ms (1.74x) threads=4 473 ms (2.36x) threads=8 663 ms End-to-end regridder __init__ drops from 1507 ms to 787 ms (-48%). Serial performance is unchanged (1126 ms on master vs 1115 ms with the patch, within run-to-run noise).
Nine tests in tests/test_boolean_operations_concurrency.py that cover
the failure modes the gil_scoped_release change could introduce:
Correctness (8 tests):
- Threaded vs serial output match for all four boolean ops on shared
input arrays (intersection / union / difference / symmetric_difference).
- Lazy-index race: fresh Geographies hit simultaneously from N threads
via a threading.Barrier to maximise first-access contention.
- Mixed operations on the same inputs from concurrent threads.
- Python-side GC/allocation churn happening in parallel with the
released-GIL boolean op.
- Input Geography refcounts unchanged after many concurrent runs.
Performance canary (1 test, @pytest.mark.slow):
- Asserts threaded < serial / 1.3 on a 4k-pair all-overlapping workload.
Fails on unpatched main (threaded ~ serial within noise) and passes
on the patched branch (~2.4x on 4 cores). Serves as a build-time
regression check for the release itself, independent of correctness.
Measured on the patched branch: all 9 pass in ~16 s on macOS arm64 /
Python 3.14 / 12 cores. On unmodified main, the canary fails in ~1 s
with \`assert 0.0956 < (0.0948 / 1.3)\` - a clear signal that the release
is missing.
Recommended: run the whole file under ThreadSanitizer in CI. The
docstring at the top notes the CFLAGS incantation.
No behavior change. Long lines wrapped / unwrapped to match the project's configured black and clang-format styles; misplaced import moved above its use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The perf canary test_gil_release_actually_enables_parallelism asserts at least a 1.3x speedup from threading. It's reliable on workstation- class hardware (2-2.4x on a 12-core M-series Mac) but the shared GitHub Actions Ubuntu runners deliver 1.22-1.27x — below the bar, above machine-level noise. The @pytest.mark.slow decorator on the test was intended to keep it out of CI but was never wired to a deselect mechanism, and the marker itself wasn't registered (pytest emitted PytestUnknownMarkWarning in every CI run). Register the slow marker in pyproject.toml and add a conftest.py that skips slow-marked tests unless --run-slow is passed. The canary stays runnable on demand (pytest --run-slow) but no longer gates CI on a flaky perf threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lock's self-referenced spherely sdist hash captures the pyproject.toml content, so any edit to that file (including the new [tool.pytest.ini_options] section) invalidates the stored sha256 and fails `pixi install --locked` in the `Tests via pixi` CI job. Regenerate the hash; no dependency changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove test_gil_release_actually_enables_parallelism. The threaded vs serial timing threshold is too tight for shared CI runners and the canary is not needed as a gate -- the eight remaining correctness and stress tests already confirm the release doesn't corrupt output. With the canary gone, the @pytest.mark.slow registration and the conftest.py that gated it are no longer needed. Also reverts the pyproject.toml [tool.pytest.ini_options] addition and the pixi.lock self-hash update that the manifest edit required. The remaining concurrency test file is trimmed to match the style of the rest of tests/: no module docstring, no section dividers, short or omitted test docstrings, no __future__ import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove three tests whose failure modes are invariants of pybind11 or
Python, not of the gil_scoped_release wrap added here:
- test_concurrent_mixed_operations: different ops on shared inputs
is the same s2 code path as the parametrized shared-inputs test,
just less focused.
- test_intersection_with_parallel_python_churn: tests the release/
re-acquire boundary against concurrent GC, but that boundary is a
pybind11 contract. A failure here would indicate a pybind11 bug,
not a spherely one.
- test_refcounts_stable_after_concurrent_runs: the change doesn't
touch refcount logic; no plausible way for a leak to appear here
that wouldn't also break the non-threaded path.
What remains targets the two concerns specific to this change:
- test_concurrent_shared_inputs_match_serial (4 parametrized ops):
concurrent reads on shared index state must not corrupt output.
- test_concurrent_lazy_index_race: first-access materialization
inside the index object during the released-GIL window on fresh
Geographies.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Geography::geog_index() is non-const: on first access it lazily populates its backing unique_ptr. That write must happen under the GIL, which means both index references have to be fetched above the gil_scoped_release scope, not inside it. Add a two-line comment so a future edit doesn't reorder those lines and race the lazy init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wrap
s2geog::s2_boolean_operationinsideBooleanOp::operator()in apy::gil_scoped_release. Callers can then parallelize boolean ops across Python threads; index extraction stays under the GIL and only the pure-C++ s2 work runs with the GIL released.PyObjectGeography operator()(PyObjectGeography a, PyObjectGeography b) const { const auto& a_index = a.as_geog_ptr()->geog_index(); const auto& b_index = b.as_geog_ptr()->geog_index(); - std::unique_ptr<s2geog::Geography> geog_out = - s2geog::s2_boolean_operation(a_index, b_index, m_op_type, m_options); + std::unique_ptr<s2geog::Geography> geog_out; + { + py::gil_scoped_release release; + geog_out = s2geog::s2_boolean_operation(a_index, b_index, m_op_type, m_options); + } return make_py_geography(std::move(geog_out)); }Why this is safe. The released-GIL block only reads the two
const S2ShapeIndex&references, andMutableS2ShapeIndexdocuments this pattern as thread-safe ins2/mutable_s2shape_index.h:That covers both the pre-built case and the lazy-first-build case; in the latter, s2's internal wait-mutex serializes the one-time build across threads.
Motivation
In a downstream gridding experiment,
spherely.intersectionover a large object array dominated the total wall time.py::vectorizeholds the GIL throughout the batched call, so chunking the array across aThreadPoolExecutorgave no speedup on its own. Releasing the GIL around the s2 work lets a batched call fan out across threads — in that experiment, end-to-end setup time dropped by about half.Measured speedups
macOS arm64, 12-core M-series (6 P-cores + 6 E-cores), Python 3.13. Median of 3 runs of the benchmark script below.
Unpatched
maindoes not scale with threads, as expected; single-threaded runtime is unchanged within noise.Benchmark
Paste into a file and run against the installed spherely. No dependencies beyond spherely + numpy.
Tests
tests/test_boolean_operations_concurrency.pyadds two tests (5 items with parametrization) that target the failure modes specific to this change:test_concurrent_shared_inputs_match_serial, parametrized over the four boolean ops, exercises concurrent reads on shared index state. Each op runs on a 4-threadThreadPoolExecutorbehind athreading.Barrierso all threads enter the released-GIL window simultaneously; outputs are compared bit-for-bit against a serial reference.test_concurrent_lazy_index_racedoes the same but on freshly-constructed Geographies, to surface any first-access materialization inside the index object that runs with the GIL released.Open questions for the maintainer
-fsanitize=threadagainst the conda-forge s2geography / s2geometry and ran the concurrency tests; no warnings. Caveat: TSAN only instruments what was compiled with the flag, and conda-forge's s2geography is not — so the wrapper is covered but the index internals aren't. Wiring TSAN into CI would need either a TSAN-built s2geography variant (doesn't exist on conda-forge) or building s2geography from source in the CI job. Left out of CI for now; happy to add it gated on merge if you'd like.BooleanOpis patched. Predicates (intersects,contains,within,touches) go through the same index-based code path and would benefit identically; happy to expand in this PR or leave for a follow-up.Appendix: performance canary
Not included in the PR — this test ran a 4000-pair overlapping-polygon workload across a
ThreadPoolExecutorto directly confirm the GIL was released. It's CPU-intensive and timing-sensitive (reliable on workstation hardware, flaky on shared CI runners). Reproduced here for anyone who wants to verify the release locally.On the 12-core M-series machine used for the speedup table above it passes with threaded ≈ 2.6× serial; on unpatched
mainit fails with threaded ≈ serial within noise.