Skip to content

feat: GPU dyn dispatch patches support#7563

Merged
0ax1 merged 31 commits intodevelopfrom
ad/cuda-patches-v2
Apr 22, 2026
Merged

feat: GPU dyn dispatch patches support#7563
0ax1 merged 31 commits intodevelopfrom
ad/cuda-patches-v2

Conversation

@0ax1
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 commented Apr 20, 2026

Integrates the structural plumbing as well as applying the patches in source and scalar ops within the GPU dynamic dispatch kernel. None of the CUDA dynamic dispatch benchmarks regressed by adding patches support.

As part of this change, the fastlanes lane count for a given type is now determined at compile time via FL_LANES<type>.

@0ax1 0ax1 requested a review from myrrc April 20, 2026 11:02
@0ax1 0ax1 changed the title chore: add patches_ptr to BitunpackParams and AlpParams chore: GPU dyn dispatch plumbing Apr 20, 2026
@0ax1 0ax1 added the changelog/chore A trivial change label Apr 20, 2026
@0ax1 0ax1 force-pushed the ad/cuda-patches-v2 branch from 1e6e8a5 to 4f2397d Compare April 20, 2026 11:04
@0ax1 0ax1 marked this pull request as draft April 20, 2026 11:42
@0ax1 0ax1 force-pushed the ad/cuda-patches-v2 branch from 4f2397d to af8e9cc Compare April 20, 2026 13:36
Structural plumbing for per-op exception patches in the fused
dynamic dispatch kernel. Adds PackedPatchesHeader and kernel
helpers (patch_fl_chunk, patch_all_fl_chunks) but does not yet
populate patches_ptr - all constructors initialize it to 0.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-patches-v2 branch from af8e9cc to 9d3b988 Compare April 20, 2026 13:46
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 20, 2026

Merging this PR will not alter performance

✅ 1163 untouched benchmarks
⏩ 1462 skipped benchmarks1


Comparing ad/cuda-patches-v2 (a187e0c) with develop (3e00b5a)

Open in CodSpeed

Footnotes

  1. 1462 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

0ax1 added 14 commits April 20, 2026 16:18
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Use inline GPUPatches/PatchesCursor for both BitPacked and ALP patches
in the fused dispatch kernel, matching the standalone kernel pattern.

BitPacked patches are applied after bitunpack into shared memory scratch,
before scalar ops. ALP patches are applied after the tile loop writes to
global memory (output stage) or after scalar ops in shared memory (input
stages).

The output Stage is stored in shared memory to reduce per-thread register
pressure from uniform plan data.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…register spills without it

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…ut-stage patches

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
BP patches applied inside bitunpack() after each FL chunk, matching
the standalone kernel pattern. ALP patches applied inside scalar_op()
ALP case after decode. No separate patch passes in execute_*_stage.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…s_ptr from PackedStage

ALP patches are now carried on AlpParams.patches_ptr inside the ScalarOp,
matching how BP patches are carried on BitunpackParams.patches_ptr inside
the SourceOp. PackedStage no longer needs a separate alp_patches_ptr field.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- walk_slice handles Slice(ALP) by manually slicing the encoded child
- Sliced BitPacked and FoR(BitPacked) with patches work end-to-end
- Fix n_chunks computation: use chunk_offsets array length, matching CPU
- Sliced ALP patches need offset adjustment (test ignored, TODO)

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…apping

Use Patches::slice(offset..offset+len) when manually slicing ALP in
walk_slice, so chunk_offsets, indices, and offset are adjusted for the
sliced range. Sliced BP patches work end-to-end.

Sliced ALP patches test remains ignored — Patches::slice has an offset
mapping bug where patches outside the slice range leak into the output.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Replace PatchesCursor with a simple linear scan over all patches in the
ALP scalar_op. PatchesCursor's chunk-based offset arithmetic gave wrong
positions for sliced arrays (the coordinate system mismatch between
output-relative abs_pos and original-array chunk indices). The linear
scan directly subtracts patches.offset from absolute patch indices to
get output positions — simple, correct, and cheap for sparse patches.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…nk positions

Replace chunk_base reconstruction with direct within-chunk position
comparison. Each value computes orig_pos = output_pos + patches.offset,
then chunk = orig_pos / 1024 and within = orig_pos % 1024. PatchesCursor
returns within-chunk indices, so comparing patch.index == within is
correct regardless of slice offset.

Remove unused scatter_patches_output helper.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…nk boundaries

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- u8, u16, u64 BitPacked with patches
- Dict with patched BitPacked codes (input stage patches)
- Patches at FL chunk boundaries (1023, 1024, 2047, 2048)
- Large array (100K elements, multi-block)
- Nullable array with patches
- All-patches extreme case (bit_width=1, every value is a patch)

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 changed the title chore: GPU dyn dispatch plumbing feat: GPU dyn dispatch patches support Apr 21, 2026
0ax1 added 6 commits April 21, 2026 13:57
- Fix double-counted offset in ALP scalar_op: subtract chunk_start from
  absolute chunk to get correct index into sliced chunk_offsets array.
  Fixes CUDA_ERROR_ILLEGAL_ADDRESS for slices with offset >= 1024.
- Widen PatchesCursor::remaining from uint8_t to uint32_t to handle
  chunks with >255 patches.
- Zero-initialize GPUPatches before serialization to avoid uninitialized
  padding bytes (technically UB under Rust memory model).
- Remove local FL_CHUNK shadow in execute_output_stage.
- Add test_sliced_alp_with_patches_large_offset (offset=1500).

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…dispatch.cu, and bit_unpack_gen.rs

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
0ax1 added 4 commits April 21, 2026 14:45
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…age field

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- Move patches from loose Stage fields to op-associated storage:
  source_patches for SourceOp, (ScalarOp, Option<Patches>) tuples
- Consolidate 7 patch tests into 3 rstest parametrized groups with
  unsliced/sliced/large-offset cases
- Fix walk_for scalar_ops push for new tuple type

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…ar patches

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-patches-v2 branch from 1b97331 to 8c7987f Compare April 21, 2026 15:13
0ax1 added 5 commits April 21, 2026 15:14
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 marked this pull request as ready for review April 21, 2026 15:40
@0ax1 0ax1 requested review from a10y and robert3005 April 21, 2026 15:40
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 added changelog/feature A new feature and removed changelog/chore A trivial change labels Apr 21, 2026
Comment thread vortex-cuda/src/dynamic_dispatch/plan_builder.rs
Comment thread vortex-cuda/src/dynamic_dispatch/plan_builder.rs
@0ax1 0ax1 requested a review from a10y April 22, 2026 13:31
@0ax1 0ax1 merged commit 029fb66 into develop Apr 22, 2026
76 of 83 checks passed
@0ax1 0ax1 deleted the ad/cuda-patches-v2 branch April 22, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants