feat: GPU dyn dispatch patches support#7563
Merged
Conversation
1e6e8a5 to
4f2397d
Compare
4f2397d to
af8e9cc
Compare
Structural plumbing for per-op exception patches in the fused dynamic dispatch kernel. Adds PackedPatchesHeader and kernel helpers (patch_fl_chunk, patch_all_fl_chunks) but does not yet populate patches_ptr - all constructors initialize it to 0. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
af8e9cc to
9d3b988
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
Use inline GPUPatches/PatchesCursor for both BitPacked and ALP patches in the fused dispatch kernel, matching the standalone kernel pattern. BitPacked patches are applied after bitunpack into shared memory scratch, before scalar ops. ALP patches are applied after the tile loop writes to global memory (output stage) or after scalar ops in shared memory (input stages). The output Stage is stored in shared memory to reduce per-thread register pressure from uniform plan data. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…register spills without it Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…ut-stage patches Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
BP patches applied inside bitunpack() after each FL chunk, matching the standalone kernel pattern. ALP patches applied inside scalar_op() ALP case after decode. No separate patch passes in execute_*_stage. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…s_ptr from PackedStage ALP patches are now carried on AlpParams.patches_ptr inside the ScalarOp, matching how BP patches are carried on BitunpackParams.patches_ptr inside the SourceOp. PackedStage no longer needs a separate alp_patches_ptr field. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- walk_slice handles Slice(ALP) by manually slicing the encoded child - Sliced BitPacked and FoR(BitPacked) with patches work end-to-end - Fix n_chunks computation: use chunk_offsets array length, matching CPU - Sliced ALP patches need offset adjustment (test ignored, TODO) Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…apping Use Patches::slice(offset..offset+len) when manually slicing ALP in walk_slice, so chunk_offsets, indices, and offset are adjusted for the sliced range. Sliced BP patches work end-to-end. Sliced ALP patches test remains ignored — Patches::slice has an offset mapping bug where patches outside the slice range leak into the output. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Replace PatchesCursor with a simple linear scan over all patches in the ALP scalar_op. PatchesCursor's chunk-based offset arithmetic gave wrong positions for sliced arrays (the coordinate system mismatch between output-relative abs_pos and original-array chunk indices). The linear scan directly subtracts patches.offset from absolute patch indices to get output positions — simple, correct, and cheap for sparse patches. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…nk positions Replace chunk_base reconstruction with direct within-chunk position comparison. Each value computes orig_pos = output_pos + patches.offset, then chunk = orig_pos / 1024 and within = orig_pos % 1024. PatchesCursor returns within-chunk indices, so comparing patch.index == within is correct regardless of slice offset. Remove unused scatter_patches_output helper. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…nk boundaries Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- u8, u16, u64 BitPacked with patches - Dict with patched BitPacked codes (input stage patches) - Patches at FL chunk boundaries (1023, 1024, 2047, 2048) - Large array (100K elements, multi-block) - Nullable array with patches - All-patches extreme case (bit_width=1, every value is a patch) Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- Fix double-counted offset in ALP scalar_op: subtract chunk_start from absolute chunk to get correct index into sliced chunk_offsets array. Fixes CUDA_ERROR_ILLEGAL_ADDRESS for slices with offset >= 1024. - Widen PatchesCursor::remaining from uint8_t to uint32_t to handle chunks with >255 patches. - Zero-initialize GPUPatches before serialization to avoid uninitialized padding bytes (technically UB under Rust memory model). - Remove local FL_CHUNK shadow in execute_output_stage. - Add test_sliced_alp_with_patches_large_offset (offset=1500). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…dispatch.cu, and bit_unpack_gen.rs Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…age field Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
- Move patches from loose Stage fields to op-associated storage: source_patches for SourceOp, (ScalarOp, Option<Patches>) tuples - Consolidate 7 patch tests into 3 rstest parametrized groups with unsliced/sliced/large-offset cases - Fix walk_for scalar_ops push for new tuple type Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…ar patches Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
1b97331 to
8c7987f
Compare
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
a10y
reviewed
Apr 21, 2026
a10y
reviewed
Apr 21, 2026
a10y
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Integrates the structural plumbing as well as applying the patches in source and scalar ops within the GPU dynamic dispatch kernel. None of the CUDA dynamic dispatch benchmarks regressed by adding patches support.
As part of this change, the fastlanes lane count for a given type is now determined at compile time via
FL_LANES<type>.