Skip to content

Warp determinism#1355

Draft
mmacklin wants to merge 10 commits intoNVIDIA:mainfrom
mmacklin:warp-deterministic
Draft

Warp determinism#1355
mmacklin wants to merge 10 commits intoNVIDIA:mainfrom
mmacklin:warp-deterministic

Conversation

@mmacklin
Copy link
Copy Markdown
Collaborator

@mmacklin mmacklin commented Apr 10, 2026

Description

Add deterministic execution mode for supported atomic patterns via wp.config.deterministic, with module-level and unique-kernel overrides through the existing module options system.

When enabled, floating-point accumulation atomics (atomic_add, atomic_sub, atomic_min, atomic_max) are transparently redirected through a deterministic scatter-sort-reduce path, and counter / allocator patterns that consume the atomic return value use an automatic two-pass count-scan-execute path. This provides bit-exact reproducible CUDA results across runs without requiring users to manually rewrite kernels.

This PR also adds targeted fixes and coverage for deterministic launch edge cases:

  • suppress side effects during the Phase 0 counting pass
  • separate scatter buffers by target and reduction op
  • improve scatter buffer capacity accounting
  • preserve wp.launch(..., record_cmd=True) support for deterministic kernels

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Test plan

Verified with:

uv run warp/tests/test_deterministic.py
uvx pre-commit run --files warp/_src/codegen.py warp/_src/context.py warp/_src/deterministic.py warp/tests/test_deterministic.py
uvx pre-commit run --files design/deterministic-execution.md warp/_src/context.py

The deterministic test module covers:

  • reproducibility for float accumulation atomics
  • += lowering to deterministic atomics
  • float64 support
  • multi-array and 2D indexing cases
  • counter / allocator reproducibility and correctness
  • mixed counter + accumulation kernels
  • per-kernel module option override
  • phase-0 side-effect suppression
  • mixed reduce ops on the same array
  • record_cmd=True support for deterministic launches

New feature / enhancement

import numpy as np
import warp as wp

wp.init()
wp.config.deterministic = True

@wp.kernel
def scatter_add(values: wp.array(dtype=wp.float32),
                indices: wp.array(dtype=wp.int32),
                out: wp.array(dtype=wp.float32)):
    tid = wp.tid()
    wp.atomic_add(out, indices[tid], values[tid])

n = 1024
values_np = np.ones(n, dtype=np.float32)
indices_np = np.arange(n, dtype=np.int32) % 16

values = wp.array(values_np, dtype=wp.float32, device="cuda")
indices = wp.array(indices_np, dtype=wp.int32, device="cuda")

results = []
for _ in range(5):
    out = wp.zeros(16, dtype=wp.float32, device="cuda")
    wp.launch(scatter_add, dim=n, inputs=[values, indices], outputs=[out], device="cuda")
    results.append(out.numpy())

# Bit-exact reproducibility across runs
for i in range(1, len(results)):
    np.testing.assert_array_equal(results[0], results[i])

Summary by CodeRabbit

  • New Features

    • Added a deterministic execution mode for atomics with global/module/kernel toggles, configurable record limits and debug diagnostics; supports ordered scatter reductions and two-pass counter/allocator semantics with GPU implementation and CPU fallbacks.
  • Documentation

    • Added a detailed design specification and changelog entry describing modes, supported patterns, limitations, and configuration semantics.
  • Tests

    • Added extensive deterministic tests covering reproducibility, correctness, overrides, capacity/overflow, and capture/replay.
  • Chores

    • Build system extended to include native deterministic sources for CPU/GPU.

OpenClaw Bot and others added 3 commits April 10, 2026 03:03
Introduce wp.config.deterministic flag that makes floating-point atomic
operations produce bit-exact reproducible results across runs. Two atomic
usage patterns are handled transparently:

Pattern A (accumulation): atomic_add/sub with unused return values are
redirected to scatter buffers during kernel execution, then sorted by
(dest_index, thread_id) and reduced in fixed order post-kernel via CUB
radix sort and a custom segmented reduce kernel.

Pattern B (counter/allocator): atomic_add with consumed return values
(slot = atomic_add(counter, 0, 1)) use automatic two-pass execution:
Phase 0 records per-thread contributions with side effects suppressed,
prefix sum computes deterministic offsets, Phase 1 re-executes with
deterministic slot assignments.

Both patterns can coexist in a single kernel. Integer atomics with
unused return values are left unchanged (already deterministic). CPU
execution is unaffected (already sequential). Configurable at global,
module, and kernel levels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an opt-in deterministic execution mode for supported atomic operations: new global/module/kernel config flags, codegen interception to emit scatter or two‑phase counter patterns, deterministic metadata and buffer management, CUDA device/CPU stubs for deterministic sort‑reduce, build updates, and extensive tests and benchmarks.

Changes

Cohort / File(s) Summary
Configuration & Design
CHANGELOG.md, design/deterministic-execution.md, warp/config.py
Introduces wp.config.deterministic and deterministic_debug, adds changelog entry and a design doc describing determinism modes, supported atomic patterns, limits, and debug behavior.
Codegen & Deterministic Metadata
warp/_src/codegen.py, warp/_src/deterministic.py
Adds interceptable atomic allowlist and order-dependent classification, DeterministicMeta and target dataclasses, codegen paths for Pattern A (scatter sort‑reduce) and Pattern B (two‑pass counter), assign-time return-use tracking, store suppression in phase 0, and helpers for target deduplication and buffer sizing.
Context, Kernel Options & Launch Runtime
warp/_src/context.py
Adds kernel decorator params deterministic and deterministic_max_records, per-kernel/module resolution and hashing to populate det_meta, DeterministicLaunch and _launch_deterministic to allocate buffers, orchestrate phase runs, update counters, and invoke post‑kernel sort/reduce; augments generated kernel signatures with hidden deterministic parameters and includes new ctypes binding.
Native Device Support & Headers
warp/native/deterministic.cu, warp/native/deterministic.h, warp/native/deterministic.cpp, warp/native/warp.h
Adds CUDA implementation for deterministic scatter/sort + segmented reduction with scalar-type dispatch, device scatter helper, header API (wp::deterministic::scatter), public entry wp_deterministic_sort_reduce_device, and CPU stub for non-CUDA builds.
Build System
build_lib.py
Adds native/deterministic.cpp to CPU build units and native/deterministic.cu to CUDA build units so deterministic native code is compiled into libraries.
Tests & Suite Integration
warp/tests/test_deterministic.py, warp/tests/unittest_suites.py, warp/tests/test_unique_module.py
Adds comprehensive deterministic tests covering scatter, counters, mixed ops, float64, capacity/overflow, per-kernel/module overrides, CUDA graph/recording/graph capture, and module-hashing tests; integrates TestDeterministic into the default suite.
Benchmarks
asv/benchmarks/atomics.py
Adds deterministic benchmark variants and sizes, new deterministic kernels, captures CUDA graphs for repeated timed replays to measure determinism overhead.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Kernel
    participant Python as Python Runtime
    participant CUDA as CUDA Device
    participant ScatterBuf as Scatter Buffers
    participant SortReduce as Sort-Reduce Kernel
    participant Dest as Destination Array

    User->>CUDA: Launch kernel (scatter pattern)
    CUDA->>ScatterBuf: wp::deterministic::scatter (pack key+value, inc counter)
    CUDA->>Python: Kernel returns
    Python->>ScatterBuf: Read record count
    Python->>SortReduce: Call device sort-reduce entry
    SortReduce->>ScatterBuf: Radix sort keys/values
    SortReduce->>Dest: Deterministic segment-wise reduce into dest
    SortReduce->>Python: Complete
    Python->>User: Return results
Loading
sequenceDiagram
    participant User as User Kernel
    participant Python as Python Runtime
    participant Phase0 as Phase 0 (Count)
    participant Scan as Prefix Scan
    participant Phase1 as Phase 1 (Execute)
    participant Counter as Counter Array

    User->>Python: Launch deterministic kernel (counter pattern)
    Python->>Phase0: Run kernel with _wp_det_phase=0 (suppress side effects)
    Phase0->>Python: Accumulated per-counter counts
    Python->>Scan: Compute deterministic prefix sums (offsets)
    Scan->>Python: Offsets ready
    Python->>Phase1: Run kernel with _wp_det_phase=1 (use offsets)
    Phase1->>Counter: Writeback using deterministic offsets
    Phase1->>Python: Complete
    Python->>User: Return results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Warp determinism' is vague and generic. While it references determinism (a real part of the changeset), it is too broad and non-descriptive to clearly convey the primary change to someone scanning pull request history. Consider a more descriptive title that clarifies the main feature, such as 'Add deterministic execution mode for atomic operations' or 'Implement configurable deterministic atomics with scatter-sort-reduce'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 10, 2026

Greptile Summary

This PR introduces a comprehensive deterministic execution mode for Warp atomics via wp.config.deterministic. Floating-point atomics (atomic_add/sub/min/max) are transparently redirected through a scatter-sort-reduce path (Pattern A), while counter/allocator patterns that consume the return value use an automatic two-pass count-scan-execute flow (Pattern B). The feature is configurable at global, module, and kernel levels, with thorough test coverage for reproducibility, overrides, graph capture, and edge cases.

  • The double array_scan over contrib at context.py:7657–7664 computes an exclusive scan (needed for Phase 1) and then immediately runs a second inclusive scan solely to read the total from its last element — this O(N) pass and its buffer allocation are unnecessary since total = prefix[dim_size-1] + contrib[dim_size-1] is available after the first scan.
  • run_sort_reduce passes capacity (the full pre-allocated buffer size, up to max(dim_size, 1024)) as the record count to the native sort-reduce entry point; the actual record count sitting in _counter is silently discarded, causing CUB to sort and reduce dead -1-keyed slots, potentially doing 10× or more unnecessary work for small launches.

Confidence Score: 4/5

Safe to merge for the primary use cases (forward-pass float32/float64 atomics), but open P1 findings from prior threads (backward kernel UB, float16 sort mismatch, counter dest_offset hardcoding) should be addressed before landing.

Score of 4 reflects that several prior-thread P1 issues remain unresolved: the backward kernel receives extra det_params in its signature but the adjoint launch path never supplies them (UB for tape-backward through deterministic kernels), float16 atomics in deterministic mode are silently broken, and the counter array total is always written to index 0 regardless of the actual atomic index. New findings are both P2 performance concerns (double scan, capacity-as-count) that do not block correctness for float32/float64 forward launches.

warp/_src/codegen.py (backward kernel det_param signature) and warp/_src/context.py (_launch_deterministic counter total update, double scan) warrant the most attention before merge.

Important Files Changed

Filename Overview
warp/_src/deterministic.py New module implementing scatter-sort-reduce (Pattern A) and two-pass counter (Pattern B) abstractions; capacity-as-count in run_sort_reduce passes more elements than needed to the native sort.
warp/_src/context.py Adds DeterministicLaunch, _launch_deterministic, and the two-pass counter orchestration; the double array_scan for the counter total is wasteful and the counter total is hardcoded to dest_offset=0 (flagged in prior thread).
warp/_src/codegen.py Adds _emit_deterministic_atomic and hidden det_params to both forward and backward kernel signatures; backward params are appended but not supplied at adjoint launch (flagged in prior thread).
warp/native/deterministic.cu Implements CUB-based radix sort + segmented reduce with scalar and composite type paths; half-precision path exists but was flagged in prior thread for potential type mismatch in scalar_run_to_run path.
warp/native/deterministic.h Defines the device-side scatter helper template; correctly packs (dest_idx, thread_id) into a 64-bit sort key and handles overflow with optional debug print.
warp/tests/test_deterministic.py Comprehensive test coverage for all patterns, overrides, graph capture, and capacity overflow; wp.clear_kernel_cache() in main violates project guidelines (flagged in prior thread).
warp/config.py Adds deterministic and deterministic_debug config options with correct default values and documentation.
warp/tests/unittest_suites.py Correctly imports and registers TestDeterministic in default_suite per project guidelines.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Launch as wp.launch()
    participant Det as _launch_deterministic()
    participant GPU as CUDA Device
    participant SortReduce as wp_deterministic_sort_reduce_device

    User->>Launch: wp.launch(kernel, dim=N, ...)
    Launch->>Launch: detect det_meta on kernel.adj
    Launch->>Det: _launch_deterministic(kernel, params, ...)

    Note over Det: Allocate scatter/counter buffers

    alt Pattern B (counter): has_counter
        Det->>GPU: Phase 0 kernel launch (side-effects suppressed, scatter disabled)
        Det->>GPU: array_scan(contrib to prefix, inclusive=False)
        Det->>GPU: array_scan(contrib to inclusive_out, inclusive=True)
        Det->>GPU: copy(inclusive_out[-1] to counter_arr[0])
        Det->>GPU: Phase 1 kernel launch (deterministic slot assignment)
    else Pattern A only (scatter)
        Det->>GPU: Single kernel launch (writes to scatter buffer)
    end

    alt has_scatter
        Det->>SortReduce: sort-reduce scatter buffer (capacity elements)
        SortReduce->>GPU: CUB RadixSort keys + values/indices
        SortReduce->>GPU: apply_reduced_runs / deterministic_reduce_kernel
        SortReduce->>GPU: write aggregates to dest_array
    end

    Det-->>Launch: return
    Launch-->>User: return (or DeterministicLaunch if record_cmd=True)
Loading

Reviews (5): Last reviewed commit: "Add deterministic mode levels" | Re-trigger Greptile

Comment on lines +231 to +237
if target.value_ctype in ("float", "wp::half"):
fn = runtime.core.wp_deterministic_sort_reduce_float_device
elif target.value_ctype == "double":
fn = runtime.core.wp_deterministic_sort_reduce_double_device
else:
warp_utils.warn(f"Unsupported value type '{target.value_ctype}' for deterministic sort-reduce.")
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 float16 sort-reduce type mismatch

wp::half values are allocated in a warp.float16 buffer (2 bytes/element), but run_sort_reduce dispatches them to wp_deterministic_sort_reduce_float_device, which reinterprets the pointer as float* (4 bytes/element). This means CUB sorts garbage data and the reduce accumulates incorrect values into the destination — the wrong number of bytes is read per record.

This path is reachable because is_float_type returns True for warp.float16 and warp_type_to_ctype returns "wp::half", so any kernel with a float16 atomic_add in deterministic mode will hit this bug silently.

The fix is either to add a dedicated wp_deterministic_sort_reduce_half_device function templated on __half, or to document and enforce that float16 atomics are not supported in deterministic mode (raising an error in _emit_deterministic_atomic).

Comment on lines +789 to +790
if __name__ == "__main__":
wp.clear_kernel_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 wp.clear_kernel_cache() in test file violates project guidelines

The project's AGENTS.md states explicitly: "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files — not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

The test plan in the PR description runs this file directly (uv run warp/tests/test_deterministic.py), which triggers this __main__ block.

Suggested change
if __name__ == "__main__":
wp.clear_kernel_cache()
if __name__ == "__main__":
unittest.main()

Comment on lines +7595 to +7603
inclusive_out = warp.empty(shape=(dim_size,), dtype=warp.int32, device=device)
warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True)
# Find the counter array in fwd_args and add the total.
for j, arg in enumerate(kernel.adj.args):
if arg.label == ct.array_var_label:
counter_arr = fwd_args[j]
# Copy the total (last element of inclusive scan) to the counter.
warp.copy(counter_arr, inclusive_out, dest_offset=0, src_offset=dim_size - 1, count=1)
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Counter total update hardcodes dest_offset=0

The runtime always writes the total count to index 0 of the counter array (dest_offset=0). If a kernel uses wp.atomic_add(counter, N, 1) with N != 0, the counter element at N is never updated after the two-pass launch — the user-visible counter value would remain stale. The common usage is N=0, but it is worth either documenting this constraint or storing the index per CounterTarget and using it here.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
warp/native/deterministic.h (1)

41-49: Clarify the non-CUDA branch comment.

Line 41 says “direct accumulation,” but Lines 42-48 are a no-op. Rewording this would prevent confusion during maintenance.

✏️ Suggested comment tweak
-    // CPU path: direct accumulation (CPU kernels are sequential).
+    // Non-CUDA path: no-op for this helper (CPU execution does not use scatter buffers).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/native/deterministic.h` around lines 41 - 49, The comment "CPU path:
direct accumulation (CPU kernels are sequential)" is misleading because the
non-CUDA branch simply voids a set of unused variables (keys, values, counter,
capacity, dest_flat_idx, thread_id, value); change the comment near that void
list in deterministic.h to clearly state these variables are intentionally
unused in the non-CUDA build (e.g., "Non-CUDA build: no per-thread accumulation
— explicitly mark these kernel-specific variables as unused") so maintainers
understand why the (void)XXX lines are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 7-11: Update the CHANGELOG entry (mentioning
wp.config.deterministic) to be API-level and include a GitHub issue/PR
reference: remove internal implementation details like "scatter-sort-reduce" and
"two-pass execution with prefix-sum-based slot assignment", instead describe the
user-visible change (e.g., "Added a deterministic execution mode for atomic
operations via wp.config.deterministic = True that makes atomic accumulations
reproducible across runs"), append a short note about scope
(global/module/kernel) and add the GH issue/PR number (e.g., "See `#1234`") and
affected version/release tag.

In `@design/deterministic-execution.md`:
- Around line 108-109: The doc currently mixes exclusive and inclusive scan
conventions: update the design to use a single convention for
wp.utils.array_scan(writeback rule) — either always document it as exclusive
(inclusive=False) and state that the total is computed as prefix[last] +
contrib[last], or document it as inclusive (inclusive=True) and state that the
total is prefix[last]; make the descriptions at the wp.utils.array_scan example
near "Prefix sum" (previously showing inclusive=False) and the section that
references "total comes from the last element" (lines ~159-160) consistent by
choosing one convention and adjusting the text to reference
wp.utils.array_scan(contrib, prefix, inclusive=<True/False>) and the
corresponding rule for computing the thread-local total.

In `@warp/_src/codegen.py`:
- Around line 42-44: The frozenset _DET_ORDER_DEPENDENT_ATOMICS (containing
"atomic_cas" and "atomic_exch") is defined but unused; either remove it or
implement the promised "warn but don't intercept" behavior: locate the code path
that dispatches/handles atomic ops (e.g., the function that processes atomic
intrinsics or emits atomics such as the atomic handling/emit function in
codegen.py), and when an atomic op name is in _DET_ORDER_DEPENDENT_ATOMICS, emit
a clear warning (use warnings.warn or the module logger) stating that
order-dependent atomics are not intercepted and will run with native ordering,
then continue normal processing without interception; if you prefer removal,
delete the _DET_ORDER_DEPENDENT_ATOMICS constant and any related comment and add
a brief unit test or code comment documenting the choice.
- Around line 1881-1882: The fallback that sets flat_idx_expr = "0" when ndim >
4 is unsafe because it silently writes everything to index 0; replace this
silent fallback with an explicit error: detect the unsupported ndim case in the
same code path in codegen.py (the branch that currently assigns flat_idx_expr)
and raise a clear exception (e.g., ValueError or RuntimeError) that includes the
invalid ndim value and a message stating Warp arrays support up to 4 dimensions;
do not assign "0" as a default index.

In `@warp/_src/context.py`:
- Around line 7477-7480: set_param_at_index_from_ctype() and
set_params_from_ctypes() mutate self.params but do not keep self.fwd_args in
sync, so _launch_deterministic() can replay using stale array objects; update
those methods to mirror the logic in set_param_at_index(): when adjoint is False
and the target index is within range(len(self.fwd_args)), assign the new value
into self.fwd_args[index] (or for bulk updates, update the corresponding
slice/indices) — or refactor those methods to call set_param_at_index(index,
value, adjoint) for each changed param so fwd_args stays consistent with params
for deterministic replay.
- Around line 7589-7602: The two-pass path must preserve the caller stream and
keep the counter's initial value: ensure the intermediate scans
(warp._src.utils.array_scan calls), the warp.copy that writes the total back
into the counter array, and the subsequent run_sort_reduce invocation are
executed on the same stream passed into this path (propagate the local stream
object into those calls or use stream-aware variants) and when writing the
counter combine the existing counter value with this-launch total (read the
current counter_arr[0], add inclusive_out[dim_size-1] and write the sum back)
instead of overwriting; update references around array_scan, inclusive_out,
counter_arr (found via kernel.adj.args and fwd_args), warp.copy and
run_sort_reduce so they all use the caller stream and perform an atomic/ordered
add of the previous counter value plus the new total.

In `@warp/_src/deterministic.py`:
- Around line 140-161: The mapping and helpers currently claim support for
warp.float16 but the native reducer entrypoint
wp_deterministic_sort_reduce_float_device (and the C++ reinterpret-casts in
native/deterministic.cu) do not handle 16-bit halves; update _WARP_TO_CTYPE and
the conversion helpers to fail fast for half: remove or change the "wp::half"
mapping so warp_type_to_ctype raises for warp.float16 (or explicitly check and
raise in is_float_type/warp_type_to_ctype), and add a clear ValueError
mentioning warp.float16 and wp_deterministic_sort_reduce_float_device so any
attempt to use half reductions immediately errors until a real half reducer is
implemented.

In `@warp/tests/test_deterministic.py`:
- Around line 789-790: Remove the call to wp.clear_kernel_cache() from this test
module (including the __main__ block) because wp.clear_kernel_cache() is
disallowed in test files; simply delete the line invoking
wp.clear_kernel_cache() so no cache-clearing is performed here.

In `@warp/tests/unittest_suites.py`:
- Line 142: The default_suite() function imports TestDeterministic but never
includes it in the test_classes list, so deterministic tests are skipped; update
the test_classes array/variable inside default_suite() to include
TestDeterministic (alongside the other classes), ensuring the symbol
TestDeterministic is added to the list used to build the suite returned by
default_suite().

---

Nitpick comments:
In `@warp/native/deterministic.h`:
- Around line 41-49: The comment "CPU path: direct accumulation (CPU kernels are
sequential)" is misleading because the non-CUDA branch simply voids a set of
unused variables (keys, values, counter, capacity, dest_flat_idx, thread_id,
value); change the comment near that void list in deterministic.h to clearly
state these variables are intentionally unused in the non-CUDA build (e.g.,
"Non-CUDA build: no per-thread accumulation — explicitly mark these
kernel-specific variables as unused") so maintainers understand why the
(void)XXX lines are present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2e7bc7a6-3bed-4194-bd96-d94feaed2cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 53a7bf5 and 5ec9b25.

📒 Files selected for processing (13)
  • CHANGELOG.md
  • build_lib.py
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/deterministic.h
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
  • warp/tests/unittest_suites.py

Comment on lines +7 to +11
- Add deterministic execution mode for atomic operations via `wp.config.deterministic = True`.
Floating-point atomic accumulations use a scatter-sort-reduce strategy for bit-exact
reproducibility across runs. Counter/allocator atomics (where the return value is used)
use automatic two-pass execution with prefix-sum-based slot assignment. Configurable at
the global, module, and kernel level.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Changelog entry should include a GH reference and stay API-level.

Lines 7-11 describe internal mechanics but do not include an issue/PR reference.

📝 Suggested rewrite
-- Add deterministic execution mode for atomic operations via `wp.config.deterministic = True`.
-  Floating-point atomic accumulations use a scatter-sort-reduce strategy for bit-exact
-  reproducibility across runs. Counter/allocator atomics (where the return value is used)
-  use automatic two-pass execution with prefix-sum-based slot assignment. Configurable at
-  the global, module, and kernel level.
+- Add deterministic atomic execution mode via `wp.config.deterministic = True`, with global, module, and kernel-level control for reproducible results across CUDA runs ([GH-1355](https://github.com/NVIDIA/warp/pull/1355)).

As per coding guidelines: "If a change modifies user-facing behavior, append an entry ... include issue refs ... and avoid internal implementation details."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Add deterministic execution mode for atomic operations via `wp.config.deterministic = True`.
Floating-point atomic accumulations use a scatter-sort-reduce strategy for bit-exact
reproducibility across runs. Counter/allocator atomics (where the return value is used)
use automatic two-pass execution with prefix-sum-based slot assignment. Configurable at
the global, module, and kernel level.
- Add deterministic atomic execution mode via `wp.config.deterministic = True`, with global, module, and kernel-level control for reproducible results across CUDA runs ([GH-1355](https://github.com/NVIDIA/warp/pull/1355)).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 7 - 11, Update the CHANGELOG entry (mentioning
wp.config.deterministic) to be API-level and include a GitHub issue/PR
reference: remove internal implementation details like "scatter-sort-reduce" and
"two-pass execution with prefix-sum-based slot assignment", instead describe the
user-visible change (e.g., "Added a deterministic execution mode for atomic
operations via wp.config.deterministic = True that makes atomic accumulations
reproducible across runs"), append a short note about scope
(global/module/kernel) and add the GH issue/PR number (e.g., "See `#1234`") and
affected version/release tag.

Comment on lines +108 to +109
2. *Prefix sum*: ``wp.utils.array_scan(contrib, prefix, inclusive=False)``
computes deterministic per-thread offsets.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use one scan convention in the design doc.

Line 108 documents an exclusive scan (inclusive=False), but Lines 159-160 say the total comes from the last element of an inclusive scan. Please make those two sections describe the same writeback rule.

Also applies to: 159-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 108 - 109, The doc currently
mixes exclusive and inclusive scan conventions: update the design to use a
single convention for wp.utils.array_scan(writeback rule) — either always
document it as exclusive (inclusive=False) and state that the total is computed
as prefix[last] + contrib[last], or document it as inclusive (inclusive=True)
and state that the total is prefix[last]; make the descriptions at the
wp.utils.array_scan example near "Prefix sum" (previously showing
inclusive=False) and the section that references "total comes from the last
element" (lines ~159-160) consistent by choosing one convention and adjusting
the text to reference wp.utils.array_scan(contrib, prefix,
inclusive=<True/False>) and the corresponding rule for computing the
thread-local total.

Comment on lines +42 to +44
# Atomics that are inherently order-dependent (warn but don't intercept).
_DET_ORDER_DEPENDENT_ATOMICS = frozenset({"atomic_cas", "atomic_exch"})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for usages of _DET_ORDER_DEPENDENT_ATOMICS across the codebase
rg -n "_DET_ORDER_DEPENDENT_ATOMICS" --type=py

Repository: NVIDIA/warp

Length of output: 150


Remove or implement the warning logic for _DET_ORDER_DEPENDENT_ATOMICS.

The frozenset is defined but never used in the codebase. The PR description mentions "warn but don't intercept" for order-dependent atomics like atomic_cas and atomic_exch, but no warning implementation exists. Either add the warning logic or remove the unused constant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/codegen.py` around lines 42 - 44, The frozenset
_DET_ORDER_DEPENDENT_ATOMICS (containing "atomic_cas" and "atomic_exch") is
defined but unused; either remove it or implement the promised "warn but don't
intercept" behavior: locate the code path that dispatches/handles atomic ops
(e.g., the function that processes atomic intrinsics or emits atomics such as
the atomic handling/emit function in codegen.py), and when an atomic op name is
in _DET_ORDER_DEPENDENT_ATOMICS, emit a clear warning (use warnings.warn or the
module logger) stating that order-dependent atomics are not intercepted and will
run with native ordering, then continue normal processing without interception;
if you prefer removal, delete the _DET_ORDER_DEPENDENT_ATOMICS constant and any
related comment and add a brief unit test or code comment documenting the
choice.

Comment on lines +1881 to +1882
else:
flat_idx_expr = "0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent fallback for ndim > 4 produces incorrect index.

When ndim > 4, the code silently falls back to flat_idx_expr = "0", which would cause all writes to target index 0, corrupting results without any warning or error.

Warp arrays support up to 4 dimensions, so this should ideally raise an error to catch any future changes or edge cases.

Proposed fix
         elif ndim == 4:
             flat_idx_expr = (
                 f"(var_{idx_loaded_list[0]} * var_{arr_loaded}.shape[1] * var_{arr_loaded}.shape[2] * var_{arr_loaded}.shape[3] "
                 f"+ var_{idx_loaded_list[1]} * var_{arr_loaded}.shape[2] * var_{arr_loaded}.shape[3] "
                 f"+ var_{idx_loaded_list[2]} * var_{arr_loaded}.shape[3] + var_{idx_loaded_list[3]})"
             )
         else:
-            flat_idx_expr = "0"
+            raise WarpCodegenError(
+                f"Deterministic atomics not supported for arrays with {ndim} dimensions (max 4)"
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
else:
flat_idx_expr = "0"
else:
raise WarpCodegenError(
f"Deterministic atomics not supported for arrays with {ndim} dimensions (max 4)"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/codegen.py` around lines 1881 - 1882, The fallback that sets
flat_idx_expr = "0" when ndim > 4 is unsafe because it silently writes
everything to index 0; replace this silent fallback with an explicit error:
detect the unsupported ndim case in the same code path in codegen.py (the branch
that currently assigns flat_idx_expr) and raise a clear exception (e.g.,
ValueError or RuntimeError) that includes the invalid ndim value and a message
stating Warp arrays support up to 4 dimensions; do not assign "0" as a default
index.

Comment on lines +7477 to +7480
def set_param_at_index(self, index: int, value: Any, adjoint: bool = False):
super().set_param_at_index(index, value, adjoint)
if not adjoint and index < len(self.fwd_args):
self.fwd_args[index] = value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep fwd_args consistent for the raw-ctype replay mutators too.

_launch_deterministic() uses self.fwd_args as the source of truth for destination/counter arrays, but this subclass only syncs it in set_param_at_index(). Calls through set_param_at_index_from_ctype() / set_params_from_ctypes() still mutate self.params without updating self.fwd_args, so a recorded deterministic launch can replay into the old array object even after the caller swapped the packed argument.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7477 - 7480,
set_param_at_index_from_ctype() and set_params_from_ctypes() mutate self.params
but do not keep self.fwd_args in sync, so _launch_deterministic() can replay
using stale array objects; update those methods to mirror the logic in
set_param_at_index(): when adjoint is False and the target index is within
range(len(self.fwd_args)), assign the new value into self.fwd_args[index] (or
for bulk updates, update the corresponding slice/indices) — or refactor those
methods to call set_param_at_index(index, value, adjoint) for each changed param
so fwd_args stays consistent with params for deterministic replay.

Comment on lines +7589 to +7602
warp._src.utils.array_scan(contrib, prefix, inclusive=False)

# Write the total count to the actual counter array so user code
# that reads it after the launch sees the correct value.
# Total = exclusive_prefix[-1] + contrib[-1].
# Use inclusive scan's last element = total
inclusive_out = warp.empty(shape=(dim_size,), dtype=warp.int32, device=device)
warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True)
# Find the counter array in fwd_args and add the total.
for j, arg in enumerate(kernel.adj.args):
if arg.label == ct.array_var_label:
counter_arr = fwd_args[j]
# Copy the total (last element of inclusive scan) to the counter.
warp.copy(counter_arr, inclusive_out, dest_offset=0, src_offset=dim_size - 1, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the caller stream and the counter’s initial value in the two-pass path.

The phase-0/phase-1 kernel launches run on stream, but the scans, counter copy-back, and final run_sort_reduce() do not. On an explicit non-current stream, those steps can reorder or escape capture. Separately, the prefix sum starts at 0 and Line 7602 overwrites the counter with only this launch’s total, so allocator patterns with a non-zero incoming counter will reuse slots from 0 instead of continuing from the existing value.

Also applies to: 7631-7631

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7589 - 7602, The two-pass path must
preserve the caller stream and keep the counter's initial value: ensure the
intermediate scans (warp._src.utils.array_scan calls), the warp.copy that writes
the total back into the counter array, and the subsequent run_sort_reduce
invocation are executed on the same stream passed into this path (propagate the
local stream object into those calls or use stream-aware variants) and when
writing the counter combine the existing counter value with this-launch total
(read the current counter_arr[0], add inclusive_out[dim_size-1] and write the
sum back) instead of overwriting; update references around array_scan,
inclusive_out, counter_arr (found via kernel.adj.args and fwd_args), warp.copy
and run_sort_reduce so they all use the caller stream and perform an
atomic/ordered add of the previous counter value plus the new total.

Comment on lines +789 to +790
if __name__ == "__main__":
wp.clear_kernel_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from the test module.

wp.clear_kernel_cache() is explicitly disallowed in test files, including __main__, because concurrent clears are not multi-process-safe.

🧹 Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
wp.clear_kernel_cache()
if __name__ == "__main__":
unittest.main(verbosity=2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 789 - 790, Remove the call to
wp.clear_kernel_cache() from this test module (including the __main__ block)
because wp.clear_kernel_cache() is disallowed in test files; simply delete the
line invoking wp.clear_kernel_cache() so no cache-clearing is performed here.

from warp.tests.test_conditional import TestConditional
from warp.tests.test_constant_precision import TestConstantPrecision
from warp.tests.test_context import TestContext
from warp.tests.test_deterministic import TestDeterministic
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add TestDeterministic to default_suite() class list.

Line 142 imports TestDeterministic, but it is not added to test_classes (Lines 225-361), so the default suite will still skip deterministic tests.

🔧 Suggested fix
     test_classes = [
         TestAdam,
         TestArithmetic,
         TestArray,
         TestArrayReduce,
         TestAsync,
         TestAtomic,
         TestAtomicBitwise,
         TestAtomicCAS,
         TestBlockDimDispatch,
         TestBool,
         TestBuiltinsResolution,
         TestBvh,
         TestClangCUDA,
         TestClosestPointEdgeEdgeMethods,
         TestCodeGen,
         TestCodeGenInstancing,
         TestConditional,
         TestConstants,
         TestConstantPrecision,
         TestContext,
+        TestDeterministic,
         TestCopy,
         TestCpuPrecompiledHeaders,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/unittest_suites.py` at line 142, The default_suite() function
imports TestDeterministic but never includes it in the test_classes list, so
deterministic tests are skipped; update the test_classes array/variable inside
default_suite() to include TestDeterministic (alongside the other classes),
ensuring the symbol TestDeterministic is added to the list used to build the
suite returned by default_suite().

Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
warp/_src/deterministic.py (1)

140-143: ⚠️ Potential issue | 🔴 Critical

Fail fast for warp.float16 deterministic reductions.

warp.float16 is still wired through the deterministic scatter/reduce path, but the native entry points only reinterpret buffers as float* or double*. Half buffers will therefore be read with the wrong element size and produce corrupted reductions. Please reject this path until a real half reducer exists.

Minimal safe guard
-        if target.value_ctype in ("float", "wp::half"):
+        if target.value_ctype == "float":
             fn = runtime.core.wp_deterministic_sort_reduce_float_device
+        elif target.value_ctype == "wp::half":
+            raise RuntimeError("Deterministic float16 atomics are not supported yet.")
         elif target.value_ctype == "double":
             fn = runtime.core.wp_deterministic_sort_reduce_double_device

Also applies to: 159-161, 177-180, 219-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 140 - 143, The mapping
_WARP_TO_CTYPE currently includes warp.float16 but the deterministic
scatter/reduce path does not support half-precision and will read buffers with
wrong element size; update deterministic.py to fail fast whenever warp.float16
is encountered in the deterministic reduction code paths by removing or guarding
the warp.float16 entry in _WARP_TO_CTYPE and adding an explicit check that
raises a clear exception (or returns an error) when a reducer or conversion
function (the code paths that reference _WARP_TO_CTYPE) sees warp.float16;
ensure the exception message names warp.float16 and the deterministic reduction
path so callers get a clear rejection until a proper half reducer is
implemented.
warp/tests/test_deterministic.py (1)

955-956: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is disallowed in tests and can crash parallel CI runs.

Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 955 - 956, Remove the
disallowed kernel-cache clear from the module-level __main__ block: delete the
call to wp.clear_kernel_cache() found in the if __name__ == "__main__": section
of warp/tests/test_deterministic.py; do not replace it with any cache-clearing
call (wp.clear_lto_cache or similar) and ensure no other module-scope or
__main__-scoped cache-clear calls remain.
design/deterministic-execution.md (1)

111-112: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the doc.

Lines 111-112 describe an exclusive scan, but Lines 170-172 still say the total comes from the last element of an inclusive scan. Please make the writeback rule consistent with the convention the implementation actually uses.

Also applies to: 170-172

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 111 - 112, The doc is
inconsistent about scan convention: wp.utils.array_scan(contrib, prefix,
inclusive=False) is described as exclusive at lines 111-112 but later (lines
170-172) claims the total comes from the last element of an inclusive scan; pick
one convention and make the writeback rule consistent with the actual
implementation. Update the description of wp.utils.array_scan, the
example/notation for "prefix" and the writeback rule (the statement about where
the total/last-offset is read) so they all use the same convention (either
inclusive or exclusive) and mention the inclusive flag (inclusive=False/True) in
the writeback explanation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@warp/_src/codegen.py`:
- Around line 1647-1650: Pattern B (two-pass deterministic atomic interception)
only runs when _det_in_assign is set during emit_Assign(), so atomic calls
nested in subscripts, call arguments, or larger expressions (e.g.,
output[wp.atomic_add(...)]) never use adj._emit_deterministic_atomic and remain
nondeterministic; update the compiler to propagate the
"deterministic-assignment" context beyond plain Assign RHS evaluation by
checking _det_in_assign (or equivalent flag) within expression emitters that can
produce targets/indices/args — specifically modify emit_Subscript, emit_Call
(and other expression emitters referenced around lines 3336-3345) to consult
adj._emit_deterministic_atomic for funcs in _DET_INTERCEPTABLE_ATOMICS (using
func.is_builtin() and func.key) and call
adj._emit_deterministic_atomic(bound_args, return_type, output, output_list)
when the flag is active so atomics inside subscripts/args follow the two-pass
deterministic path.
- Around line 1834-1849: The Pattern B counter/allocator block that runs when
return_is_consumed currently assumes an add-style atomic and unconditionally
emits contrib += value and prefix returns; restrict this path to only add-style
atomics: inside the branch guarded by return_is_consumed (the block that calls
get_or_create_counter_target and writes _wp_det_contrib and _wp_det_prefix via
adj.add_forward), inspect the counter target's atomic/op kind (from the target
returned by get_or_create_counter_target — e.g., target.op, target.atomic_op, or
whatever property names the CounterTarget uses) and only emit Pattern B when
that property indicates an atomic_add; otherwise fail fast (raise a clear
NotImplementedError or emit an error) explaining that consumed-return semantics
for atomic_sub/atomic_max are not implemented. Ensure the check is placed before
emitting the adj.add_forward code so non-add atomics do not get incorrect
rewrites.

In `@warp/_src/deterministic.py`:
- Around line 191-200: The scratch buffers in allocate_counter_buffers always
use warp.int32 which truncates non-32-bit counters; update
allocate_counter_buffers to inspect each CounterTarget.value_ctype and allocate
contrib/prefix with the matching dtype (e.g., warp.int64 or warp.uint64) so the
generated ABI names (_wp_det_contrib/_wp_det_prefix) match the counter width, or
if supporting only 32-bit is preferred, raise an explicit error when a
CounterTarget has a non-32-bit value_ctype; modify allocate_counter_buffers to
perform this dtype-selection/check using the CounterTarget.value_ctype before
creating warp.zeros/warp.empty.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 111-112: The doc is inconsistent about scan convention:
wp.utils.array_scan(contrib, prefix, inclusive=False) is described as exclusive
at lines 111-112 but later (lines 170-172) claims the total comes from the last
element of an inclusive scan; pick one convention and make the writeback rule
consistent with the actual implementation. Update the description of
wp.utils.array_scan, the example/notation for "prefix" and the writeback rule
(the statement about where the total/last-offset is read) so they all use the
same convention (either inclusive or exclusive) and mention the inclusive flag
(inclusive=False/True) in the writeback explanation.

In `@warp/_src/deterministic.py`:
- Around line 140-143: The mapping _WARP_TO_CTYPE currently includes
warp.float16 but the deterministic scatter/reduce path does not support
half-precision and will read buffers with wrong element size; update
deterministic.py to fail fast whenever warp.float16 is encountered in the
deterministic reduction code paths by removing or guarding the warp.float16
entry in _WARP_TO_CTYPE and adding an explicit check that raises a clear
exception (or returns an error) when a reducer or conversion function (the code
paths that reference _WARP_TO_CTYPE) sees warp.float16; ensure the exception
message names warp.float16 and the deterministic reduction path so callers get a
clear rejection until a proper half reducer is implemented.

In `@warp/tests/test_deterministic.py`:
- Around line 955-956: Remove the disallowed kernel-cache clear from the
module-level __main__ block: delete the call to wp.clear_kernel_cache() found in
the if __name__ == "__main__": section of warp/tests/test_deterministic.py; do
not replace it with any cache-clearing call (wp.clear_lto_cache or similar) and
ensure no other module-scope or __main__-scoped cache-clear calls remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4ca0283f-84b8-4939-90c1-9a656d3d0098

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec9b25 and 72e8e3f.

📒 Files selected for processing (8)
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cu
  • warp/native/deterministic.h
  • warp/tests/test_deterministic.py
✅ Files skipped from review due to trivial changes (2)
  • warp/native/deterministic.h
  • warp/native/deterministic.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • warp/_src/context.py

Comment on lines +1647 to +1650
if adj.det_meta is not None and func.is_builtin() and func.key in _DET_INTERCEPTABLE_ATOMICS:
det_output = adj._emit_deterministic_atomic(func, bound_args, return_type, output, output_list)
if det_output is not None:
return det_output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pattern B only triggers for plain assignments.

_det_in_assign is only flipped while emit_Assign() evaluates the RHS, so consumed atomics inside subscripts, call args, or larger expressions never take the two-pass path. For example, output[wp.atomic_add(counter, 0, 1)] = value still runs the native atomic and remains nondeterministic.

Also applies to: 3336-3345

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/codegen.py` around lines 1647 - 1650, Pattern B (two-pass
deterministic atomic interception) only runs when _det_in_assign is set during
emit_Assign(), so atomic calls nested in subscripts, call arguments, or larger
expressions (e.g., output[wp.atomic_add(...)]) never use
adj._emit_deterministic_atomic and remain nondeterministic; update the compiler
to propagate the "deterministic-assignment" context beyond plain Assign RHS
evaluation by checking _det_in_assign (or equivalent flag) within expression
emitters that can produce targets/indices/args — specifically modify
emit_Subscript, emit_Call (and other expression emitters referenced around lines
3336-3345) to consult adj._emit_deterministic_atomic for funcs in
_DET_INTERCEPTABLE_ATOMICS (using func.is_builtin() and func.key) and call
adj._emit_deterministic_atomic(bound_args, return_type, output, output_list)
when the flag is active so atomics inside subscripts/args follow the two-pass
deterministic path.

Comment on lines +1834 to +1849
if return_is_consumed:
# Pattern B: Counter/Allocator
target = get_or_create_counter_target(adj.det_meta, arr_var.label, value_ctype)
N = target.index

val_loaded = loaded_args[-1] # already loaded above

adj.add_forward("#ifdef __CUDA_ARCH__", skip_replay=True)
adj.add_forward(
f"if (_wp_det_phase == 0) {{ "
f"_wp_det_contrib_{N}[_idx] += var_{val_loaded}; "
f"var_{output} = {zero_literal}; "
f"}} else {{ "
f"var_{output} = static_cast<{value_ctype}>(_wp_det_prefix_{N}[_idx]); "
f"_wp_det_prefix_{N}[_idx] += var_{val_loaded}; "
f"}}",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Only atomic_add matches this counter rewrite.

Once return_is_consumed is true, the generated code always does contrib += value and returns prefix offsets. That is only valid for add-style counters; slot = wp.atomic_sub(...) or old = wp.atomic_max(...) will return nonsense instead of the atomic’s previous value. Please gate Pattern B to atomic_add (or fail fast) until the other consumed-return semantics are implemented.

Safe short-term guard
-        if return_is_consumed:
+        if return_is_consumed:
+            if func.key != "atomic_add":
+                return None
             # Pattern B: Counter/Allocator
             target = get_or_create_counter_target(adj.det_meta, arr_var.label, value_ctype)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if return_is_consumed:
# Pattern B: Counter/Allocator
target = get_or_create_counter_target(adj.det_meta, arr_var.label, value_ctype)
N = target.index
val_loaded = loaded_args[-1] # already loaded above
adj.add_forward("#ifdef __CUDA_ARCH__", skip_replay=True)
adj.add_forward(
f"if (_wp_det_phase == 0) {{ "
f"_wp_det_contrib_{N}[_idx] += var_{val_loaded}; "
f"var_{output} = {zero_literal}; "
f"}} else {{ "
f"var_{output} = static_cast<{value_ctype}>(_wp_det_prefix_{N}[_idx]); "
f"_wp_det_prefix_{N}[_idx] += var_{val_loaded}; "
f"}}",
if return_is_consumed:
if func.key != "atomic_add":
return None
# Pattern B: Counter/Allocator
target = get_or_create_counter_target(adj.det_meta, arr_var.label, value_ctype)
N = target.index
val_loaded = loaded_args[-1] # already loaded above
adj.add_forward("#ifdef __CUDA_ARCH__", skip_replay=True)
adj.add_forward(
f"if (_wp_det_phase == 0) {{ "
f"_wp_det_contrib_{N}[_idx] += var_{val_loaded}; "
f"var_{output} = {zero_literal}; "
f"}} else {{ "
f"var_{output} = static_cast<{value_ctype}>(_wp_det_prefix_{N}[_idx]); "
f"_wp_det_prefix_{N}[_idx] += var_{val_loaded}; "
f"}}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/codegen.py` around lines 1834 - 1849, The Pattern B
counter/allocator block that runs when return_is_consumed currently assumes an
add-style atomic and unconditionally emits contrib += value and prefix returns;
restrict this path to only add-style atomics: inside the branch guarded by
return_is_consumed (the block that calls get_or_create_counter_target and writes
_wp_det_contrib and _wp_det_prefix via adj.add_forward), inspect the counter
target's atomic/op kind (from the target returned by
get_or_create_counter_target — e.g., target.op, target.atomic_op, or whatever
property names the CounterTarget uses) and only emit Pattern B when that
property indicates an atomic_add; otherwise fail fast (raise a clear
NotImplementedError or emit an error) explaining that consumed-return semantics
for atomic_sub/atomic_max are not implemented. Ensure the check is placed before
emitting the adj.add_forward code so non-add atomics do not get incorrect
rewrites.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Comment on lines +219 to +220
if target.value_ctype in ("float", "wp::half"):
fn = runtime.core.wp_deterministic_sort_reduce_float_device
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 float16 dispatched to float sort-reduce

wp::half values are stored in 2-byte slots, but wp_deterministic_sort_reduce_float_device reinterprets the values pointer as float* (4 bytes per element). CUB will sort pairs that are 2× too large and the reduce kernel will read the wrong bytes, silently producing incorrect results. This path is reachable because is_float_type(warp.float16) is True.

Fix: add a dedicated wp_deterministic_sort_reduce_half_device (templated on __half), or raise an error in _emit_deterministic_atomic when scalar_dtype is warp.float16 to block the unsupported path.

Generalize deterministic scatter-reduce to operate on Warp leaf types instead of only scalar float values. This preserves existing component-wise atomic semantics for vectors and matrices while keeping the graph-capture-safe fixed-capacity launch path.

Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
warp/tests/test_deterministic.py (1)

1143-1145: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is explicitly disallowed in test files and can break parallel test runs.

Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 1143 - 1145, Remove the call
to wp.clear_kernel_cache() from the module-level __main__ block; locate the if
__name__ == "__main__": block (where wp.clear_kernel_cache() is called before
unittest.main) and delete that wp.clear_kernel_cache() invocation so tests no
longer call kernel-cache clearing from the test module.
design/deterministic-execution.md (1)

114-115: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the design.

The Pattern B section documents array_scan(..., inclusive=False), but the writeback section still says the total comes from the last element of an inclusive scan. Please pick one convention and make the total-count rule match it everywhere.

Also applies to: 173-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 114 - 115, The docs use two
different scan conventions; standardize on one and make all references
consistent: choose whether wp.utils.array_scan(contrib, prefix, inclusive=True)
or inclusive=False is the canonical API, then update Pattern B and the writeback
section (and the other occurrence at lines ~173-175) so the phrase "total comes
from the last element" matches that convention (e.g., if using inclusive=False,
state that total = prefix[-1] + contrib[-1] or if inclusive=True, state total =
prefix[-1]); update any explanatory text and examples to reference
wp.utils.array_scan(..., inclusive=...) and the single total-count rule
consistently.
🧹 Nitpick comments (1)
warp/tests/test_deterministic.py (1)

692-753: Assert the actual slot order, not just counts/permutations.

counter_kernel should yield output == data_np, and conditional_counter_kernel should yield output[:expected_count] == data_np[data_np > threshold]. Right now a stable but wrong permutation would still pass because these tests only check counts, sorting, and cross-run reproducibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 692 - 753, Update the two
tests to assert actual slot order rather than just counts/permutations: in
test_counter_correctness replace the sorted/permutation check with a direct
equality assertion that output.numpy() (or output.numpy().tolist()) exactly
equals data_np (ensuring dtype/shape match) to verify counter_kernel produces
output == data_np; in test_conditional_counter replace the permutation-based
checks with a direct equality assertion that output.numpy()[:expected_count]
equals data_np[data_np > threshold] to verify conditional_counter_kernel
preserves the original relative order of selected elements; keep the existing
cross-run determinism checks but change the comparisons to elementwise equality
rather than sorted/comparison of permutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The docs use two different scan conventions; standardize
on one and make all references consistent: choose whether
wp.utils.array_scan(contrib, prefix, inclusive=True) or inclusive=False is the
canonical API, then update Pattern B and the writeback section (and the other
occurrence at lines ~173-175) so the phrase "total comes from the last element"
matches that convention (e.g., if using inclusive=False, state that total =
prefix[-1] + contrib[-1] or if inclusive=True, state total = prefix[-1]); update
any explanatory text and examples to reference wp.utils.array_scan(...,
inclusive=...) and the single total-count rule consistently.

In `@warp/tests/test_deterministic.py`:
- Around line 1143-1145: Remove the call to wp.clear_kernel_cache() from the
module-level __main__ block; locate the if __name__ == "__main__": block (where
wp.clear_kernel_cache() is called before unittest.main) and delete that
wp.clear_kernel_cache() invocation so tests no longer call kernel-cache clearing
from the test module.

---

Nitpick comments:
In `@warp/tests/test_deterministic.py`:
- Around line 692-753: Update the two tests to assert actual slot order rather
than just counts/permutations: in test_counter_correctness replace the
sorted/permutation check with a direct equality assertion that output.numpy()
(or output.numpy().tolist()) exactly equals data_np (ensuring dtype/shape match)
to verify counter_kernel produces output == data_np; in test_conditional_counter
replace the permutation-based checks with a direct equality assertion that
output.numpy()[:expected_count] equals data_np[data_np > threshold] to verify
conditional_counter_kernel preserves the original relative order of selected
elements; keep the existing cross-run determinism checks but change the
comparisons to elementwise equality rather than sorted/comparison of
permutations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a316d310-4647-410d-a32d-1acc2968e8e5

📥 Commits

Reviewing files that changed from the base of the PR and between 72e8e3f and 48e7207.

📒 Files selected for processing (8)
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • warp/native/deterministic.cpp
  • warp/_src/context.py
  • warp/_src/codegen.py
  • warp/_src/deterministic.py

@shi-eric shi-eric added this to the 1.14.0 milestone Apr 10, 2026
Replace the capacity-style deterministic scatter option with deterministic_max_records so the API matches the codegen lower-bound model. This lets users size dynamic-loop workloads in terms of per-thread atomic record counts instead of raw buffer capacity.

Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Comment on lines +5197 to +5213
# Deterministic mode: backward kernel also gets scatter buffer params
# for deterministic gradient accumulation.
if device == "cuda" and adj.det_meta is not None and adj.det_meta.needs_deterministic:
det = adj.det_meta
if det.has_counter:
reverse_args.append("int _wp_det_phase")
for ct in det.counter_targets:
reverse_args.append(f"int* _wp_det_contrib_{ct.index}")
reverse_args.append(f"int* _wp_det_prefix_{ct.index}")
for st in det.scatter_targets:
reverse_args.append(f"int64_t* _wp_scatter_keys_{st.index}")
reverse_args.append(f"{st.value_ctype}* _wp_scatter_vals_{st.index}")
reverse_args.append(f"int* _wp_scatter_ctr_{st.index}")
reverse_args.append(f"int* _wp_scatter_overflow_{st.index}")
reverse_args.append(f"int _wp_scatter_cap_{st.index}")
if det.has_scatter:
reverse_args.append("int _wp_det_debug")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Backward kernel gains det params that are never supplied at launch

The backward kernel signature is extended with up to 5*N_scatter + 3*N_counter + 1 extra parameters here, but the adjoint launch path in context.py is guarded by not adjoint (line 7798), so it goes through the plain CUDA launch at line 7901 — which builds kernel_params from only the user-facing args. The CUDA driver then reads beyond the end of that ctypes array to satisfy the additional declared parameters, which is undefined behaviour on the host side.

The backward body does not actually use these params (// deterministic scatter replay (skipped) replay strings are just comments), so there are two clean fixes:

# Option A — don't add det params to the backward kernel at all
# (backward determinism is not implemented yet)
if device == "cuda" and adj.det_meta is not None and adj.det_meta.needs_deterministic:
    det = adj.det_meta
    # (remove this block entirely until backward determinism is wired up)
# Option B — keep the params and wire up the backward deterministic launch
if det_meta is not None and det_meta.needs_deterministic and device.is_cuda:
    # adjoint path also calls _launch_deterministic (or similar)

Option A is safer given the current state of the implementation.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
warp/_src/context.py (2)

7628-7641: ⚠️ Potential issue | 🟠 Major

Keep the count/scan/reduce fixups on stream, and accumulate the existing counter value.

array_scan(), warp.copy(), and run_sort_reduce() all omit stream here, so an explicit non-current stream or graph capture can reorder or escape the deterministic pass. Also, Line 7641 writes only this launch's total into counter_arr[0]; a counter that started at N will end at total instead of N + total.

Also applies to: 7670-7670

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7628 - 7641, The code omits the stream
argument and overwrites existing counter values; update calls to
warp._src.utils.array_scan(contrib, prefix, inclusive=False) and the inclusive
scan warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well
as warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.

7505-7508: ⚠️ Potential issue | 🟠 Major

Raw-ctype array updates are still unsafe for deterministic replays.

Line 7507 only keeps fwd_args aligned for set_param_at_index(). The inherited set_param_at_index_from_ctype() / set_params_from_ctypes() can still replace an array descriptor while _launch_deterministic() later uses the stale Python array object for warp.copy() and run_sort_reduce(). Recorded deterministic launches can therefore replay into the wrong array, or fail once fwd_args and self.params diverge. If these APIs stay exposed, DeterministicLaunch needs to reject array-ctype updates or carry the owning Warp array alongside the packed descriptor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7505 - 7508, The deterministic replay bug:
set_param_at_index only updates fwd_args but inherited
set_param_at_index_from_ctype / set_params_from_ctypes can replace raw C-type
array descriptors in self.params leaving fwd_args stale, which breaks
_launch_deterministic and later warp.copy/run_sort_reduce replay; to fix, either
(A) make DeterministicLaunch reject/raise when set_param_at_index_from_ctype or
set_params_from_ctypes is used with array-ctypes, or (B) ensure those
ctype-updating paths also update the corresponding Python owning Warp array
stored in fwd_args (and keep self.params and fwd_args synchronized) so that
set_param_at_index_from_ctype / set_params_from_ctypes maintain alignment with
fwd_args before _launch_deterministic runs.
design/deterministic-execution.md (1)

114-115: ⚠️ Potential issue | 🟡 Minor

Use one scan convention in the design doc.

Line 114 documents wp.utils.array_scan(contrib, prefix, inclusive=False), but Lines 176-178 say the total comes from the last element of an inclusive scan. Please make both sections describe the same writeback rule. If the scan stays exclusive, the total is prefix[-1] + contrib[-1].

Also applies to: 176-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 114 - 115, The doc currently
mixes scan conventions: update both occurrences to use the same convention for
wp.utils.array_scan(contrib, prefix, inclusive=False) and the writeback rule;
either change the function call to inclusive=True and state the total is
prefix[-1], or keep inclusive=False and change the total calculation text (the
section that currently says the total comes from the last element of an
inclusive scan) to the exclusive rule: total = prefix[-1] + contrib[-1]; make
this consistent for the wp.utils.array_scan(contrib, prefix, inclusive=False)
mention and the later total computation paragraph.
warp/tests/test_deterministic.py (1)

1190-1192: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is explicitly disallowed in test files and can crash parallel test runs. unittest.main() is enough here.

🧹 Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 1190 - 1192, Remove the
explicit wp.clear_kernel_cache() call from the module-level __main__ block in
this test file; keep only the unittest.main(verbosity=2) invocation so the test
can run without clearing kernel caches (do not add wp.clear_lto_cache() either).
Locate the __main__ guard containing wp.clear_kernel_cache() and delete that
call, ensuring the block now simply calls unittest.main(...) and nothing else.
warp/_src/deterministic.py (1)

228-231: ⚠️ Potential issue | 🟠 Major

Counter scratch buffers still hard-code wp.int32.

CounterTarget.value_ctype is tracked, but contrib and prefix are always allocated as wp.int32. Any deterministic wp.int64/wp.uint64 counter will silently wrap in Phase 0/1. Please make these scratch buffers follow value_ctype, or fail fast for non-32-bit counters until the wider ABI is wired through end-to-end.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 228 - 231, The loop that allocates
counter scratch buffers currently hardcodes dtype=warp.int32 for contrib and
prefix; change the allocation to use each CounterTarget's value_ctype (e.g., use
dtype=target.value_ctype or map that C type to the corresponding warp dtype) so
contrib and prefix match CounterTarget.value_ctype and avoid silent wrapping, or
add an explicit fast-fail that raises if target.value_ctype is not a 32-bit
integer type (when full 64-bit ABI isn't supported). Update the allocations in
the for _target in counter_targets block (the contrib and prefix
warp.zeros/warp.empty calls) to derive dtype from the target's value_ctype and
keep shape= (dim_size,) and device=device.
🧹 Nitpick comments (2)
warp/tests/test_deterministic.py (1)

850-879: test_module_option_override() doesn't prove the override is active on CUDA.

A single approximate-sum assertion will pass whether this kernel took the deterministic path or the normal atomic path, especially on CPU where both paths are deterministic. Please mirror test_kernel_decorator_override() here: run the kernel several times on CUDA and assert bit-exact equality across runs, then keep the sum check as a secondary sanity check if you want.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 850 - 879,
test_module_option_override currently only checks approximate sums so it doesn't
verify the per-module deterministic=True override on CUDA; update the test
(function test_module_option_override) to mirror test_kernel_decorator_override
by: when device is CUDA (check device == "cuda" or via wp.get_device_name/device
backend), run the kernel per_kernel_det multiple times (e.g., 3+ runs) into
separate outputs and assert bit-exact equality of the outputs across runs (use
output.numpy() equality checks) to prove determinism, while retaining the
existing sum/assert_allclose sanity check as a secondary assertion; ensure you
still toggle wp.config.deterministic = False around the launch to verify the
per-module override takes effect.
warp/_src/deterministic.py (1)

193-193: Annotate device with DeviceLike.

These helper signatures take device untyped, which diverges from the repo's device-parameter convention and makes this launch path harder to type-check consistently.

As per coding guidelines, "Use DeviceLike type annotation (from warp._src.context) for device parameters. Import under TYPE_CHECKING to avoid circular imports."

Also applies to: 222-222, 235-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` at line 193, The device parameter in
allocate_scatter_buffers should be annotated with DeviceLike: import DeviceLike
from warp._src.context inside a TYPE_CHECKING block (from typing import
TYPE_CHECKING) to avoid circular imports, then change the function signature to
accept device: DeviceLike; apply the same change to the other two nearby helper
functions in this module that take a device parameter so all device-typed helper
signatures use DeviceLike consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@warp/_src/context.py`:
- Around line 7510-7526: Before calling _launch_deterministic in launch, guard
the deterministic path with the same null checks used by the regular CUDA path:
verify self.kernel is not None and that self.hooks.forward exists; if either
check fails raise a clear Python-side RuntimeError (or appropriate exception)
with a descriptive message instead of calling _launch_deterministic with a null
kernel handle. Apply the same change to DeterministicLaunch.launch to keep
behavior consistent.
- Around line 2198-2205: ModuleHasher.hash_kernel currently only hashes
kernel.key and kernel.adj which allows reuse of stale cached modules when
per-kernel overrides (e.g., kernel.options like deterministic or
deterministic_max_records) change; update ModuleHasher.hash_kernel to
incorporate kernel.options (or at minimum the deterministic and
deterministic_max_records flags/values) into the computed hash so that
Module.load and the unique-module cache distinguish kernels with different
per-kernel deterministic overrides, and ensure any code paths that call
hash_kernel (e.g., Module.load) will therefore generate distinct hashes when
those option values differ.

In `@warp/_src/deterministic.py`:
- Around line 68-74: CounterTarget is currently keyed only by array_var_label
causing get_or_create_counter_target() to collapse all logical counters (e.g.,
wp.atomic_add(counter, bucket, 1) across different bucket indices) into one
global sequence; fix by including the logical index in the CounterTarget
key/metadata (add a field for the logical counter index and propagate it from
the codegen caller into get_or_create_counter_target()), or alternatively
enforce deterministic mode by detecting non-constant-zero indices in
wp.atomic_add (and reject/raise an error) so Phase 1 cannot hand out global
offsets for variable indices; update all related uses (CounterTarget class,
get_or_create_counter_target(), and the call sites that build Pattern B
metadata) to reflect the new keying or the rejection behavior.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The doc currently mixes scan conventions: update both
occurrences to use the same convention for wp.utils.array_scan(contrib, prefix,
inclusive=False) and the writeback rule; either change the function call to
inclusive=True and state the total is prefix[-1], or keep inclusive=False and
change the total calculation text (the section that currently says the total
comes from the last element of an inclusive scan) to the exclusive rule: total =
prefix[-1] + contrib[-1]; make this consistent for the
wp.utils.array_scan(contrib, prefix, inclusive=False) mention and the later
total computation paragraph.

In `@warp/_src/context.py`:
- Around line 7628-7641: The code omits the stream argument and overwrites
existing counter values; update calls to warp._src.utils.array_scan(contrib,
prefix, inclusive=False) and the inclusive scan
warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well as
warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.
- Around line 7505-7508: The deterministic replay bug: set_param_at_index only
updates fwd_args but inherited set_param_at_index_from_ctype /
set_params_from_ctypes can replace raw C-type array descriptors in self.params
leaving fwd_args stale, which breaks _launch_deterministic and later
warp.copy/run_sort_reduce replay; to fix, either (A) make DeterministicLaunch
reject/raise when set_param_at_index_from_ctype or set_params_from_ctypes is
used with array-ctypes, or (B) ensure those ctype-updating paths also update the
corresponding Python owning Warp array stored in fwd_args (and keep self.params
and fwd_args synchronized) so that set_param_at_index_from_ctype /
set_params_from_ctypes maintain alignment with fwd_args before
_launch_deterministic runs.

In `@warp/_src/deterministic.py`:
- Around line 228-231: The loop that allocates counter scratch buffers currently
hardcodes dtype=warp.int32 for contrib and prefix; change the allocation to use
each CounterTarget's value_ctype (e.g., use dtype=target.value_ctype or map that
C type to the corresponding warp dtype) so contrib and prefix match
CounterTarget.value_ctype and avoid silent wrapping, or add an explicit
fast-fail that raises if target.value_ctype is not a 32-bit integer type (when
full 64-bit ABI isn't supported). Update the allocations in the for _target in
counter_targets block (the contrib and prefix warp.zeros/warp.empty calls) to
derive dtype from the target's value_ctype and keep shape= (dim_size,) and
device=device.

In `@warp/tests/test_deterministic.py`:
- Around line 1190-1192: Remove the explicit wp.clear_kernel_cache() call from
the module-level __main__ block in this test file; keep only the
unittest.main(verbosity=2) invocation so the test can run without clearing
kernel caches (do not add wp.clear_lto_cache() either). Locate the __main__
guard containing wp.clear_kernel_cache() and delete that call, ensuring the
block now simply calls unittest.main(...) and nothing else.

---

Nitpick comments:
In `@warp/_src/deterministic.py`:
- Line 193: The device parameter in allocate_scatter_buffers should be annotated
with DeviceLike: import DeviceLike from warp._src.context inside a TYPE_CHECKING
block (from typing import TYPE_CHECKING) to avoid circular imports, then change
the function signature to accept device: DeviceLike; apply the same change to
the other two nearby helper functions in this module that take a device
parameter so all device-typed helper signatures use DeviceLike consistently.

In `@warp/tests/test_deterministic.py`:
- Around line 850-879: test_module_option_override currently only checks
approximate sums so it doesn't verify the per-module deterministic=True override
on CUDA; update the test (function test_module_option_override) to mirror
test_kernel_decorator_override by: when device is CUDA (check device == "cuda"
or via wp.get_device_name/device backend), run the kernel per_kernel_det
multiple times (e.g., 3+ runs) into separate outputs and assert bit-exact
equality of the outputs across runs (use output.numpy() equality checks) to
prove determinism, while retaining the existing sum/assert_allclose sanity check
as a secondary assertion; ensure you still toggle wp.config.deterministic =
False around the launch to verify the per-module override takes effect.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f76535da-266c-45ce-9df4-48647bb5e6f1

📥 Commits

Reviewing files that changed from the base of the PR and between 48e7207 and 3170255.

📒 Files selected for processing (4)
  • design/deterministic-execution.md
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/tests/test_deterministic.py

Comment on lines +7510 to +7526
def launch(self, stream: Stream | None = None) -> None:
if stream is None:
stream = self.device.stream

_launch_deterministic(
self.kernel,
self.hooks,
self.params,
self.bounds,
self.device,
stream,
self.max_blocks,
self.block_dim,
self.det_meta,
self.fwd_args,
module_exec=self.module_exec,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preserve the normal forward-hook null check in deterministic launches.

The regular CUDA path raises if hooks.forward is missing, but both this branch and DeterministicLaunch.launch() jump straight into _launch_deterministic(). If wp_cuda_get_kernel() returned None, we'll pass a null kernel handle to wp_cuda_launch_kernel() instead of surfacing the clearer Python-side error.

Also applies to: 7796-7825

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7510 - 7526, Before calling
_launch_deterministic in launch, guard the deterministic path with the same null
checks used by the regular CUDA path: verify self.kernel is not None and that
self.hooks.forward exists; if either check fails raise a clear Python-side
RuntimeError (or appropriate exception) with a descriptive message instead of
calling _launch_deterministic with a null kernel handle. Apply the same change
to DeterministicLaunch.launch to keep behavior consistent.

Comment on lines +68 to +74
class CounterTarget:
"""Tracks a Pattern B (counter/allocator) atomic target array during codegen."""

array_var_label: str # label of the target array Var
value_ctype: str # C type of the counter value (e.g., "int")
index: int = 0 # counter buffer index (assigned during codegen)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pattern B collapses all counters in the same array into one scan.

CounterTarget is keyed only by array_var_label, and the Pattern B codegen path only passes that label into get_or_create_counter_target(). A kernel like wp.atomic_add(counter, bucket, 1) will therefore share one (contrib, prefix) stream across every bucket value, so Phase 1 can only hand out offsets from one global sequence. Please either include the logical counter index in the deterministic metadata or reject non-constant-zero counter indices in deterministic mode.

Also applies to: 128-139

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 68 - 74, CounterTarget is currently
keyed only by array_var_label causing get_or_create_counter_target() to collapse
all logical counters (e.g., wp.atomic_add(counter, bucket, 1) across different
bucket indices) into one global sequence; fix by including the logical index in
the CounterTarget key/metadata (add a field for the logical counter index and
propagate it from the codegen caller into get_or_create_counter_target()), or
alternatively enforce deterministic mode by detecting non-constant-zero indices
in wp.atomic_add (and reject/raise an error) so Phase 1 cannot hand out global
offsets for variable indices; update all related uses (CounterTarget class,
get_or_create_counter_target(), and the call sites that build Pattern B
metadata) to reflect the new keying or the rejection behavior.

OpenClaw Bot added 4 commits April 12, 2026 22:11
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
warp/_src/context.py (3)

7532-7535: ⚠️ Potential issue | 🟠 Major

Keep fwd_args synchronized for ctype-based param mutators too.

Line 7532 only syncs self.fwd_args for set_param_at_index(). Calls through set_param_at_index_from_ctype() / set_params_from_ctypes() can still leave fwd_args stale, so deterministic replay can post-process the wrong arrays.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7532 - 7535, The fwd_args list is only
updated in set_param_at_index, so calls that mutate params via ctype helpers
leave fwd_args stale; modify set_param_at_index_from_ctype and
set_params_from_ctypes to also synchronize self.fwd_args (either by delegating
to set_param_at_index for each change or by applying the same
index-check-and-assign logic used there), honoring the adjoint flag so adjoint
updates do not overwrite fwd_args.

7541-7553: ⚠️ Potential issue | 🟡 Minor

Preserve the forward-hook null guard in deterministic launch routing.

These deterministic branches call _launch_deterministic() without the regular forward-hook existence check. If hooks.forward is missing, this can pass a null kernel handle to CUDA launch instead of raising the clearer Python-side error.

Also applies to: 7825-7854

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7541 - 7553, The deterministic launch
branches call _launch_deterministic(...) without checking for hooks.forward,
which can allow a null kernel handle to reach CUDA; add the same forward-hook
null guard used in the non-deterministic path: verify hooks.forward (or
equivalent forward kernel handle) is present before invoking
_launch_deterministic, and if missing raise the same Python-side error/exception
used elsewhere so we fail early and clearly; apply the same fix to the other
deterministic branch around the second occurrence noted (near the 7825-7854
block).

7657-7670: ⚠️ Potential issue | 🟠 Major

Two-pass deterministic path still breaks explicit-stream ordering and counter continuity.

Line 7657/Line 7664 scans, Line 7670 counter writeback, and Line 7699 sort-reduce are not stream-bound in this path. On non-current streams/capture this can reorder work. Also, Line 7670 overwrites the counter with this-launch total instead of preserving and adding the pre-existing counter value.

Also applies to: 7699-7699

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7657 - 7670, The two-pass deterministic
path uses array_scan and warp.copy without binding them to the current
stream/capture and overwrites the counter rather than adding to the pre-existing
value; update the logic around warp._src.utils.array_scan(contrib, ...), the
inclusive_out buffer, and the writeback that finds counter_arr via
kernel.adj.args and ct.array_var_label so that (1) all scans, the inclusive_out
usage, the warp.copy writeback, and the later sort-reduce are submitted/bound to
the same current stream/capture to preserve ordering, and (2) when writing the
total into counter_arr you first read the existing counter value from
counter_arr and add the inclusive_out last element (the this-launch total) to it
(or perform an atomic-add equivalent) instead of blindly overwriting; ensure the
same stream-binding fix is applied to the sort-reduce at the other location
(around the existing sort-reduce call).
design/deterministic-execution.md (1)

129-130: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the design doc.

Pattern B says the prefix pass uses inclusive=False, but the writeback section says the total comes from “the last element of the inclusive scan.” Those are different rules, so the doc currently describes two incompatible implementations.

Also applies to: 191-193

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 129 - 130, The doc
inconsistently specifies scan conventions: Pattern B states the prefix pass uses
wp.utils.array_scan(contrib, prefix, inclusive=False) while the writeback
section refers to “the last element of the inclusive scan”; pick one convention
and update all references to match. Concretely, decide whether prefix scans are
inclusive or exclusive, then change mentions in Pattern B, the writeback
section, and other occurrences (e.g., lines ~191-193) so they all reference the
same wp.utils.array_scan(..., inclusive=...) behavior and explain how the total
is derived (either from the last element of the inclusive scan or from
last+contrib for exclusive).
warp/_src/deterministic.py (2)

283-293: ⚠️ Potential issue | 🟠 Major

Match the counter scratch buffers to the counter width.

CounterTarget.value_ctype is tracked, but allocate_counter_buffers() always allocates wp.int32 for both scratch arrays. That will truncate deterministic int64/uint64 counters unless they are rejected earlier. Either allocate these buffers from value_ctype, or fail fast for non-32-bit counters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 283 - 293, The counter scratch
buffers are always allocated as warp.int32 in allocate_counter_buffers, which
will truncate 64-bit counters; update allocate_counter_buffers to use each
CounterTarget's value_ctype (CounterTarget.value_ctype) when creating contrib
and prefix (i.e., pass dtype=_target.value_ctype to warp.zeros/warp.empty) or
alternatively raise/assert if _target.value_ctype is not a 32-bit type so
non-32-bit counters fail fast; ensure you reference the function
allocate_counter_buffers and the CounterTarget.value_ctype field when making the
change.

189-200: ⚠️ Potential issue | 🟠 Major

Don't merge every logical counter in an array into one deterministic target.

get_or_create_counter_target() only keys on array_var_label. If deterministic mode ever sees wp.atomic_add(counter, bucket, 1), every bucket in that array will reuse one contrib/prefix stream and Phase 1 will hand out slots from a single global sequence. Please include the logical counter index in the target identity, or reject non-constant-zero counter indices before creating a CounterTarget.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 189 - 200,
get_or_create_counter_target currently deduplicates targets solely by
array_var_label which causes all array elements to share one deterministic
counter; update get_or_create_counter_target to include the logical counter
index in the identity (e.g., use a tuple of (array_var_label, logical_index) or
add a logical_index field to CounterTarget) so each distinct bucket gets its own
CounterTarget, and before creating a new CounterTarget validate the provided
index expression is a constant zero (or reject/non-deterministic indices) if
your design only allows index 0; ensure you update the lookup loop to compare
the new key (array_var_label plus logical index) and append the new
CounterTarget with the logical index populated when creating it.
🧹 Nitpick comments (2)
warp/_src/context.py (1)

1339-1341: Validate deterministic_max_records early for type/range.

This value is currently accepted broadly and coerced later in launch. Adding an upfront int + non-negative validation in option ingestion would fail fast and avoid silent coercion paths.

Also applies to: 2583-2585

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 1339 - 1341, Validate
deterministic_max_records immediately before assigning into kernel_options:
ensure the value is an int (or can be safely converted) and is >= 0, raising a
TypeError for non-int types and a ValueError for negative values so callers fail
fast; then assign kernel_options["deterministic_max_records"] =
deterministic_max_records only after validation. Apply the same validation logic
at the other ingestion site referenced around lines 2583-2585 to keep behavior
consistent.
warp/tests/test_unique_module.py (1)

170-205: Keep the hash assertion runnable on CPU-only jobs.

Line 172 skips the whole test, but the module.name comparison at Lines 189-193 is device-independent. Splitting this into a CPU-safe hashing assertion plus a CUDA-only launch check would preserve coverage for the unique-module hashing change on non-CUDA runners.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_unique_module.py` around lines 170 - 205, The test currently
skips the entire test when CUDA is unavailable, but the module name hashing
assertion in test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@asv/benchmarks/atomics.py`:
- Around line 242-243: The class attribute params (and param_names) in the
benchmark classes are defined as lists which creates mutable class state; change
them to tuples so they are immutable (e.g., replace params = (["normal",
"deterministic"], [1, 65536], DETERMINISTIC_BENCHMARK_SIZES) with a
tuple-of-tuples and param_names likewise) and apply the same change to the
second occurrence around lines 317-318; update the definitions referenced as
params and param_names in asv/benchmarks/atomics.py so both benchmark classes
use immutable tuples instead of lists.

In `@warp/_src/codegen.py`:
- Around line 1785-1788: The early-return that skips codegen for integer atomics
(the condition using is_float_type(scalar_dtype) and return_is_consumed) is
unsafe when a counting pass exists; modify that condition to also detect whether
the current kernel contains any consumed/counter atomic (i.e., only return None
when not is_float_type(scalar_dtype) AND not return_is_consumed AND there is no
consumed/counter atomic in the kernel). Use the existing kernel/AST context or
add a predicate (e.g., has_consumed_counter_atomic) to check for consumed
atomics before returning so phase 0 will not double-apply native integer atomics
in mixed kernels.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 129-130: The doc inconsistently specifies scan conventions:
Pattern B states the prefix pass uses wp.utils.array_scan(contrib, prefix,
inclusive=False) while the writeback section refers to “the last element of the
inclusive scan”; pick one convention and update all references to match.
Concretely, decide whether prefix scans are inclusive or exclusive, then change
mentions in Pattern B, the writeback section, and other occurrences (e.g., lines
~191-193) so they all reference the same wp.utils.array_scan(..., inclusive=...)
behavior and explain how the total is derived (either from the last element of
the inclusive scan or from last+contrib for exclusive).

In `@warp/_src/context.py`:
- Around line 7532-7535: The fwd_args list is only updated in
set_param_at_index, so calls that mutate params via ctype helpers leave fwd_args
stale; modify set_param_at_index_from_ctype and set_params_from_ctypes to also
synchronize self.fwd_args (either by delegating to set_param_at_index for each
change or by applying the same index-check-and-assign logic used there),
honoring the adjoint flag so adjoint updates do not overwrite fwd_args.
- Around line 7541-7553: The deterministic launch branches call
_launch_deterministic(...) without checking for hooks.forward, which can allow a
null kernel handle to reach CUDA; add the same forward-hook null guard used in
the non-deterministic path: verify hooks.forward (or equivalent forward kernel
handle) is present before invoking _launch_deterministic, and if missing raise
the same Python-side error/exception used elsewhere so we fail early and
clearly; apply the same fix to the other deterministic branch around the second
occurrence noted (near the 7825-7854 block).
- Around line 7657-7670: The two-pass deterministic path uses array_scan and
warp.copy without binding them to the current stream/capture and overwrites the
counter rather than adding to the pre-existing value; update the logic around
warp._src.utils.array_scan(contrib, ...), the inclusive_out buffer, and the
writeback that finds counter_arr via kernel.adj.args and ct.array_var_label so
that (1) all scans, the inclusive_out usage, the warp.copy writeback, and the
later sort-reduce are submitted/bound to the same current stream/capture to
preserve ordering, and (2) when writing the total into counter_arr you first
read the existing counter value from counter_arr and add the inclusive_out last
element (the this-launch total) to it (or perform an atomic-add equivalent)
instead of blindly overwriting; ensure the same stream-binding fix is applied to
the sort-reduce at the other location (around the existing sort-reduce call).

In `@warp/_src/deterministic.py`:
- Around line 283-293: The counter scratch buffers are always allocated as
warp.int32 in allocate_counter_buffers, which will truncate 64-bit counters;
update allocate_counter_buffers to use each CounterTarget's value_ctype
(CounterTarget.value_ctype) when creating contrib and prefix (i.e., pass
dtype=_target.value_ctype to warp.zeros/warp.empty) or alternatively
raise/assert if _target.value_ctype is not a 32-bit type so non-32-bit counters
fail fast; ensure you reference the function allocate_counter_buffers and the
CounterTarget.value_ctype field when making the change.
- Around line 189-200: get_or_create_counter_target currently deduplicates
targets solely by array_var_label which causes all array elements to share one
deterministic counter; update get_or_create_counter_target to include the
logical counter index in the identity (e.g., use a tuple of (array_var_label,
logical_index) or add a logical_index field to CounterTarget) so each distinct
bucket gets its own CounterTarget, and before creating a new CounterTarget
validate the provided index expression is a constant zero (or
reject/non-deterministic indices) if your design only allows index 0; ensure you
update the lookup loop to compare the new key (array_var_label plus logical
index) and append the new CounterTarget with the logical index populated when
creating it.

---

Nitpick comments:
In `@warp/_src/context.py`:
- Around line 1339-1341: Validate deterministic_max_records immediately before
assigning into kernel_options: ensure the value is an int (or can be safely
converted) and is >= 0, raising a TypeError for non-int types and a ValueError
for negative values so callers fail fast; then assign
kernel_options["deterministic_max_records"] = deterministic_max_records only
after validation. Apply the same validation logic at the other ingestion site
referenced around lines 2583-2585 to keep behavior consistent.

In `@warp/tests/test_unique_module.py`:
- Around line 170-205: The test currently skips the entire test when CUDA is
unavailable, but the module name hashing assertion in
test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 493e06fd-9309-43dc-ad53-7cc5a97aa0b8

📥 Commits

Reviewing files that changed from the base of the PR and between 3170255 and e34a93c.

📒 Files selected for processing (11)
  • asv/benchmarks/atomics.py
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
  • warp/tests/test_unique_module.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • warp/native/warp.h
  • warp/native/deterministic.cpp
  • warp/config.py

Comment on lines +242 to +243
params = (["normal", "deterministic"], [1, 65536], DETERMINISTIC_BENCHMARK_SIZES)
param_names = ["mode", "num_outputs", "num_elements"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make ASV parameter metadata immutable.

params is a class attribute in both benchmark classes, and using lists here is what Ruff is flagging. Switching these to tuples avoids shared mutable state and keeps the benchmark module lint-clean.

♻️ Minimal fix
-    params = (["normal", "deterministic"], [1, 65536], DETERMINISTIC_BENCHMARK_SIZES)
+    params = (("normal", "deterministic"), (1, 65536), tuple(DETERMINISTIC_BENCHMARK_SIZES))
     param_names = ["mode", "num_outputs", "num_elements"]
@@
-    params = (["normal", "deterministic"], DETERMINISTIC_BENCHMARK_SIZES)
+    params = (("normal", "deterministic"), tuple(DETERMINISTIC_BENCHMARK_SIZES))
     param_names = ["mode", "num_elements"]

Also applies to: 317-318

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 243-243: Mutable default value for class attribute

(RUF012)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@asv/benchmarks/atomics.py` around lines 242 - 243, The class attribute params
(and param_names) in the benchmark classes are defined as lists which creates
mutable class state; change them to tuples so they are immutable (e.g., replace
params = (["normal", "deterministic"], [1, 65536],
DETERMINISTIC_BENCHMARK_SIZES) with a tuple-of-tuples and param_names likewise)
and apply the same change to the second occurrence around lines 317-318; update
the definitions referenced as params and param_names in
asv/benchmarks/atomics.py so both benchmark classes use immutable tuples instead
of lists.

Comment on lines +1785 to +1788
# Integer atomics with associative+commutative ops (add/sub/min/max)
# that don't use the return value are already deterministic — skip.
if not is_float_type(scalar_dtype) and not return_is_consumed:
return None # fall through to normal codegen
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Phase 0 still double-applies native integer atomics in mixed kernels.

This early fallback is only safe when there is no counting pass. If the same kernel also contains a consumed counter atomic, phase 0 runs once before phase 1, so an unchanged call like wp.atomic_add(int_out, i, 1) executes twice and overcounts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/codegen.py` around lines 1785 - 1788, The early-return that skips
codegen for integer atomics (the condition using is_float_type(scalar_dtype) and
return_is_consumed) is unsafe when a counting pass exists; modify that condition
to also detect whether the current kernel contains any consumed/counter atomic
(i.e., only return None when not is_float_type(scalar_dtype) AND not
return_is_consumed AND there is no consumed/counter atomic in the kernel). Use
the existing kernel/AST context or add a predicate (e.g.,
has_consumed_counter_atomic) to check for consumed atomics before returning so
phase 0 will not double-apply native integer atomics in mixed kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants