feat: GPU dyn dispatch patches support#7563

Merged

0ax1 merged 31 commits intodevelopfrom

ad/cuda-patches-v2

Apr 22, 2026

Contributor

0ax1 commented Apr 20, 2026 •

edited

Loading

Integrates the structural plumbing as well as applying the patches in source and scalar ops within the GPU dynamic dispatch kernel. None of the CUDA dynamic dispatch benchmarks regressed by adding patches support.

As part of this change, the fastlanes lane count for a given type is now determined at compile time via FL_LANES<type>.

0ax1 requested a review from myrrc

April 20, 2026 11:02

0ax1 changed the title ~~chore: add patches_ptr to BitunpackParams and AlpParams~~ chore: GPU dyn dispatch plumbing

0ax1 added the changelog/chore label

0ax1 force-pushed the ad/cuda-patches-v2 branch from 1e6e8a5 to 4f2397d Compare

April 20, 2026 11:04

0ax1 marked this pull request as draft

April 20, 2026 11:42

0ax1 force-pushed the ad/cuda-patches-v2 branch from 4f2397d to af8e9cc Compare

April 20, 2026 13:36


          chore: add patches_ptr to BitunpackParams and AlpParams

9d3b988

Structural plumbing for per-op exception patches in the fused
dynamic dispatch kernel. Adds PackedPatchesHeader and kernel
helpers (patch_fl_chunk, patch_all_fl_chunks) but does not yet
populate patches_ptr - all constructors initialize it to 0.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 force-pushed the ad/cuda-patches-v2 branch from af8e9cc to 9d3b988 Compare

April 20, 2026 13:46

codspeed-hq Bot commented Apr 20, 2026 •

edited

Loading

Merging this PR will not alter performance

✅ 1163 untouched benchmarks
⏩ 1462 skipped benchmarks¹

_{Comparing ad/cuda-patches-v2 (a187e0c) with develop (3e00b5a)}

1462 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

0ax1 added 14 commits

April 20, 2026 16:18

wip

d6486c0

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          feat(cuda): wire patches through fused dynamic dispatch

339e5d0

Use inline GPUPatches/PatchesCursor for both BitPacked and ALP patches
in the fused dispatch kernel, matching the standalone kernel pattern.

BitPacked patches are applied after bitunpack into shared memory scratch,
before scalar ops. ALP patches are applied after the tile loop writes to
global memory (output stage) or after scalar ops in shared memory (input
stages).

The output Stage is stored in shared memory to reduce per-thread register
pressure from uniform plan data.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          perf(cuda): remove shared-memory Stage — inline PatchesCursor avoids …

2e4bb3f

…register spills without it

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          refactor(cuda): extract scatter_patches_to_output helper for ALP outp…

dd82236

…ut-stage patches

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          refactor(cuda): apply patches inline in their respective ops

895a198

BP patches applied inside bitunpack() after each FL chunk, matching
the standalone kernel pattern. ALP patches applied inside scalar_op()
ALP case after decode. No separate patch passes in execute_*_stage.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          refactor(cuda): move ALP patches_ptr to AlpParams — remove alp_patche…

4bfe76c

…s_ptr from PackedStage

ALP patches are now carried on AlpParams.patches_ptr inside the ScalarOp,
matching how BP patches are carried on BitunpackParams.patches_ptr inside
the SourceOp. PackedStage no longer needs a separate alp_patches_ptr field.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          test(cuda): add fused dispatch tests for BitPacked and ALP patches

8ac5ddd

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          feat(cuda): support sliced arrays with patches in fused dispatch

a481cf2

- walk_slice handles Slice(ALP) by manually slicing the encoded child
- Sliced BitPacked and FoR(BitPacked) with patches work end-to-end
- Fix n_chunks computation: use chunk_offsets array length, matching CPU
- Sliced ALP patches need offset adjustment (test ignored, TODO)

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          fix(cuda): slice ALP patches with Patches::slice for correct offset m…

1bcf3a4

…apping

Use Patches::slice(offset..offset+len) when manually slicing ALP in
walk_slice, so chunk_offsets, indices, and offset are adjusted for the
sliced range. Sliced BP patches work end-to-end.

Sliced ALP patches test remains ignored — Patches::slice has an offset
mapping bug where patches outside the slice range leak into the output.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          fix(cuda): use linear patch scan for ALP — fixes sliced ALP patches

90ed644

Replace PatchesCursor with a simple linear scan over all patches in the
ALP scalar_op. PatchesCursor's chunk-based offset arithmetic gave wrong
positions for sliced arrays (the coordinate system mismatch between
output-relative abs_pos and original-array chunk indices). The linear
scan directly subtracts patches.offset from absolute patch indices to
get output positions — simple, correct, and cheap for sparse patches.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          fix(cuda): correct ALP patches for sliced arrays — compare within-chu…

069b050

…nk positions

Replace chunk_base reconstruction with direct within-chunk position
comparison. Each value computes orig_pos = output_pos + patches.offset,
then chunk = orig_pos / 1024 and within = orig_pos % 1024. PatchesCursor
returns within-chunk indices, so comparing patch.index == within is
correct regardless of slice offset.

Remove unused scatter_patches_output helper.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          fix(cuda): revert to per-value PatchesCursor for ALP — tiles span chu…

469dad7

…nk boundaries

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          chore(cuda): clippy fixes, rustfmt (nightly), clang-format

26053e5

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          test(cuda): comprehensive fused dispatch patch tests

707f688

- u8, u16, u64 BitPacked with patches
- Dict with patched BitPacked codes (input stage patches)
- Patches at FL chunk boundaries (1023, 1024, 2047, 2048)
- Large array (100K elements, multi-block)
- Nullable array with patches
- All-patches extreme case (bit_width=1, every value is a patch)

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 changed the title ~~chore: GPU dyn dispatch plumbing~~ feat: GPU dyn dispatch patches support

0ax1 added 6 commits

April 21, 2026 13:57


          fix(cuda): ALP patches offset>=1024, PatchesCursor overflow, padding UB

19d1e1b

- Fix double-counted offset in ALP scalar_op: subtract chunk_start from
  absolute chunk to get correct index into sliced chunk_offsets array.
  Fixes CUDA_ERROR_ILLEGAL_ADDRESS for slices with offset >= 1024.
- Widen PatchesCursor::remaining from uint8_t to uint32_t to handle
  chunks with >255 patches.
- Zero-initialize GPUPatches before serialization to avoid uninitialized
  padding bytes (technically UB under Rust memory model).
- Remove local FL_CHUNK shadow in execute_output_stage.
- Add test_sliced_alp_with_patches_large_offset (offset=1500).

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          chore: fix clippy truncation warnings in upload_gpu_patches

163154c

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          style(cuda): replace bare 1024 with FL_CHUNK in patches.cuh, dynamic_…

296b541

…dispatch.cu, and bit_unpack_gen.rs

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          Merge remote-tracking branch 'origin/develop' into ad/cuda-patches-v2

bd547a0

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          fix: update BitPacked::encode calls for new ExecutionCtx parameter

b9ac775

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          style: rustfmt (nightly)

19e1256

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added 4 commits

April 21, 2026 14:45


          style(cuda): rename loop variable i to elem_idx in execute_input_stage

dc42095

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          docs(cuda): explain patches_ptr placement on union variants vs per-st…

c6700ae

…age field

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          refactor: tie patches to owning ops, consolidate tests into rstest

40df862

- Move patches from loose Stage fields to op-associated storage:
  source_patches for SourceOp, (ScalarOp, Option<Patches>) tuples
- Consolidate 7 patch tests into 3 rstest parametrized groups with
  unsliced/sliced/large-offset cases
- Fix walk_for scalar_ops push for new tuple type

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          refactor: merge upload_patches, uniform match pattern for source/scal…

8c7987f

…ar patches

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 force-pushed the ad/cuda-patches-v2 branch from 1b97331 to 8c7987f Compare

April 21, 2026 15:13

0ax1 added 5 commits

April 21, 2026 15:14


          docs: compact patches_ptr placement comment

0f32673

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          docs: compact ALP slice comment

640d017

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          clippy

308c4fa

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          fix: remove broken match patterns, clippy fixes, unused import

e3f0aee

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>


          ref: reuse load patches

30d299f

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 marked this pull request as ready for review

April 21, 2026 15:40

0ax1 requested review from a10y and robert3005

April 21, 2026 15:40


          fix: remove unnecessary mut from cuda_ctx declarations

a187e0c

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added changelog/feature and removed changelog/chore labels

a10y reviewed

View reviewed changes

vortex-cuda/src/dynamic_dispatch/plan_builder.rs

a10y reviewed

View reviewed changes

vortex-cuda/src/dynamic_dispatch/plan_builder.rs

0ax1 requested a review from a10y

April 22, 2026 13:31

a10y approved these changes

View reviewed changes

0ax1 merged commit 029fb66 into develop

76 of 83 checks passed

0ax1 deleted the ad/cuda-patches-v2 branch

April 22, 2026 13:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature