Skip to content

data-parallel patched ALP standalone kernel#7576

Open
a10y wants to merge 7 commits intodevelopfrom
aduffy/alp-patched
Open

data-parallel patched ALP standalone kernel#7576
a10y wants to merge 7 commits intodevelopfrom
aduffy/alp-patched

Conversation

@a10y
Copy link
Copy Markdown
Contributor

@a10y a10y commented Apr 20, 2026

Summary

Follow up to #7440

This changes ALP execution on CUDA. Previously, we'd execute two kernel passes: one to perform ALP decoding to global memory, and a second to apply patches.

This PR works similarly to prior work to push patching into the decoding kernel for unpacking. We assign a FastLanes 1024-element block to each warp (32 threads), and then perform decoding and patching in a single kernel pass.

Testing

Unit tests were added to check for simple and edge cases (multi-chunk, mix of chunks with/without patches)

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/alp-patched branch from 9dc91b1 to 949acdd Compare April 20, 2026 19:40
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y added the changelog/feature A new feature label Apr 20, 2026
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Apr 20, 2026

Benchmark improvements VS develop on GH 200

~/vortex$ nvidia-smi
Mon Apr 20 22:09:35 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 480GB             On  |   00000000:DD:00.0 Off |                  Off |
| N/A   35C    P0             74W /  700W |       3MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
Screenshot 2026-04-20 at 18 08 42

Comment thread vortex-cuda/src/kernel/encodings/alp.rs Outdated
a10y added 4 commits April 21, 2026 08:46
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y marked this pull request as ready for review April 21, 2026 15:33
Comment thread vortex-cuda/kernels/src/alp.cu Outdated
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/alp-patched branch from e51a081 to bc78994 Compare April 21, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants