Fix link to BenchmarkList in README.md by zqb-all · Pull Request #1 · cupbop/CuPBoP_Vortex

zqb-all · 2026-04-07T08:05:40Z

No description provided.

Due to the following warning messages: Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v2. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/ update build.yml to use Node.js 20

Since actions/checkout@v2 makes use of Node.js 16, we need to updgrade to use actions/checkout@v3 instead (which uses Node.js 20). As mentioned here: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/ Node.js 16 is going to be deprecated in Spring 2024.

- CMakeLists.txt: forward LLVM_PREFIX, TOOLDIR, VORTEX_HOME, CuPBoP_PATH, RISCV_TOOLCHAIN to kernel Makefile so cudaKernelImpl_64.o is rebuilt during cmake build - Remove pre-built .o files from git (they were stale and missing __nv_erfcf, causing blackscholes link failure) - .o files already in .gitignore, will be ignored going forward Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removes ~60+ errs()/dbgs() prints that dump IR to stderr during normal compilation. Keeps cupbop_debug() guarded prints and actual error messages. Files: performance.cpp, memory_hierarchy.cpp, insert_sync.cpp, insert_warp_loop.cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ensures cudaKernelImpl_64.o is always rebuilt from source, preventing stale .o files from causing link errors (e.g. missing __nv_erfcf in blackscholes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

performance.cpp: wrap analysis registration/running prints cg_sync.cpp: wrap CG Sync detection print Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI_RUN_ARGS=32 to avoid timeout (was 40min+ at 128). Correctness check changed from golden diff to result file existence check since golden was generated at size 128. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Golden generated from NVIDIA GPU with args "32 10". Reverts CI correctness check back to golden diff comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cupbop_debug() is static in tool.cpp, not accessible from cg_sync.cpp. Remove the debug print entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Was static in tool.cpp, causing build errors in cg_sync.cpp and performance.cpp which also call cupbop_debug(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Now using shared inline definition from tool.h. Fixes redefinition error in generate_wrapper.cpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous CI runs leave Perf_counter files that don't have matching ci_status files for the current run, showing as ❓. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removes ci_status, ci_output, Perf_counter, result_vortex files from all examples before building, so summary only shows current run results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tions Indirect calls (e.g. llvm.lifetime.end) have null getCalledFunction(). Using getCalledOperand()->getName() on these crashes with "dyn_cast on a non-existent value". Fixed in: - tool.cpp: InlineAsm null check (line 1057), has_warp_barrier, has_barrier, has_block_barrier - insert_warp_loop.cpp: getParallelRegionBefore (line 1296), runOnFunction (line 1498) - handle_sync.cpp: split_block_by_sync (line 37) - insert_sync.cpp: InsertConditionalForBarrier (line 474) All use getCalledFunction() instead of getCalledOperand() and add !Call->getCalledFunction() null check. Fixes wedford-cuda, score-cuda, marchingCubes-cuda SCHE_0 crashes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…shes - Add warp_shfl/warp_vote/vote_count definitions to generated kernel_wrapper.cpp for SCHE_0/1 (Vortex kernel had no definition, only CPU threadPool runtime did) - Set setDSOLocal(true) on warp_shfl/warp_vote/vote_count in init.cpp (required for RISC-V TLS, matching other TLS globals) - Fix sche_data.inner_loop_cond null crash: initialize inner loop variables in add_mapping_variable when need_nested_loop && intra_warp_loop - Fix last 2 getCalledOperand()->getName() calls in insert_warp_loop.cpp (lines 1212, 1640) with null-check + getCalledFunction() Tested: wedford-cuda, score-cuda, bh-cuda SCHE_0 all PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…er detection - tool.cpp: Fix threadIdx.y warp size constant 4→32, add %block_size_y for 3D blocks, implement threadIdx.z = flat_tid / (block_size_x * block_size_y) - generate_wrapper.cpp: Include block_size_z in block_size calculation (x*y*z) - warp_func.cpp: Add shfl idx and xor variants to replaceWarpShflFlat, detect C++ mangled __shfl*_sync functions in FLAT mode (declaration-only wrappers like _Z16__shfl_down_syncifii), remove dead shfl wrapper functions after FLAT replacement to prevent RISC-V backend crash - marchingCubes-cuda: Reduce countingThreadNumLv2 1024→256 (Vortex HW limit 512) Tested: wedford/score/bh SCHE_0 PASS, jacobi SCHE_0 builds with warp_shfl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n errors) The hasCppWrappers logic in replaceWarpShfl caused: - score-cuda: replaceAllUses type mismatch assertion - marchingCubes/jaccard: instruction domination errors - srad_v2: LLVM UnifyLoopExits assertion Revert to original NVVM-intrinsic-only detection path. C++ wrapper calls (_Z16__shfl_down_sync etc) will be resolved at link time by cudaKernelImpl. Keep shfl idx/xor variants and isShflCall lambda for future use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add has_warp_shfl_usage() to detect warp_shfl global users and activate nested loop (intra + inter warp) for kernels using shuffle emulation - Skip llvm.lifetime intrinsics in alloca divergence and non-PR checks (only when nested loop is active, to avoid regression on bh/backprop) - Prevent double context save: skip allocas with _intra_warp_/_inter_warp_ suffix in all three AddContextSaveRestore call sites - Guard isDivergenceSource intra_warp_index check with need_nested_loop to avoid unnecessary divergence propagation in single-loop kernels - Add inline_shfl_helpers() for transitive inlining of helper functions containing shfl calls, with barrier0 insertion after each inlined call - Re-run inline_func_with_tid + remove_cuda_built_in after shfl helpers - Skip shfl inline in inline_warp_level_func for SCHE_0 (handled by FLAT) - Fix warp_shfl GEP: use builder.CreateGEP/CreateLoad directly to prevent constant folding that caused type mismatch in score-cuda - Use barrier0 (CreateInterWarpBarrier) in replaceWarpShflFlat for proper parallel region construction Tested: wedford block_y=1 SCHE_0 PASSED, dotproduct PASSED, backprop PASSED Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jaccard outputs PASSED for weighted case and FAIL for unweighted case. grep -q "PASSED" matched the first, ignoring the FAIL. Add ! grep "^FAIL" check so any FAIL in ci_output.txt causes CI failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Guard inline_shfl_helpers with SCHE_0 check — it was running for SCHE_2 too, causing timeout regressions (wedford SCHE_2: 1min → timeout) - Restrict FLAT shfl replacement to kernel functions only — helper functions like parallel_prefix_sum had their shfl replaced but weren't wrapped by warp loop, causing domination errors - Fix shfl operand domination: always save shfl_offset to alloca before barrier, load after barrier. split_block_by_sync splits at barrier0, separating operand definition from use. Save result to alloca with per-user loads for cross-BB result users. - Skip shfl_* allocas in insert_warp_loop context save — these are shfl emulation temporaries, not user data. Converting them to per-thread arrays breaks shfl emulation. Tested: jaccard SCHE_0 builds, wedford SCHE_2 PASSED, dotproduct SCHE_0 PASSED Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reduce JACOBI_N=16 MAX_ITERS=10 (was 32/100) to fit CI timeout - Switch CI verification to golden comparison (was PASSED grep) - Generate golden from CUDA GPU with same settings - jacobi SCHE_2 PASSED (Error=0.0855817 matches GPU exactly) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add lowerswitch pass after kernelTranslator in common.mk to eliminate switch instructions that crash Vortex's UnifyLoopExits pass - Add KERNEL_OPT variable (default -O3) to allow per-example override; srad_v2 uses -O0 to prevent CFGSimplification from recreating switches - Expand lower_constant_expr to handle ConstantExpr in store value operand and atomicrmw pointer operand (needed for huffman LOCALMEM=0) - Use per-instruction operand replacement instead of CE-wide user replacement to avoid cross-function domination errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reduce srad_v2 input to 16x16 (was 128x128) for CI timeout - Generate golden_output.txt from CUDA GPU with matching args - Revert KERNEL_OPT and lowerswitch workarounds in common.mk (fix is in Vortex LLVM: LowerSwitch moved before UnifyLoopExits) - Revert common.mk Step 5 to original clang++ -O3 compilation Requires Vortex LLVM patch: RISCVTargetMachine.cpp LowerSwitch unconditionally + after CFGSimplification in divergence chain. Tested: srad_v2 SCHE_0 PASSED (values match CUDA GPU golden) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CUDA's __device__ qualifier puts global variables in addrspace(1), but Vortex/RISC-V uses flat address space (0). Without conversion, addrspacecast produces invalid pointers at runtime. - New mem_device2global() in memory_hierarchy.cpp - Filters out CUDA builtins (extern_weak), wrapper_global_ (from mem_share2global_sche_2), and declarations - Called after mem_constant2global in init_block - cc-cuda: addrspace(1) for topL/posL/topH/posH now converted (runtime error remains — separate issue) Tested: dotproduct PASSED (no regression) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…var persistence - mem_device2global: convert __device__ variables from addrspace(1) to 0 Filters out CUDA builtins, wrapper_global_, and declarations Uses ExternalLinkage so wrapper can reference device symbols Writes DeviceVariables to kernel_meta.log for host runtime - Fix kernel name matching for static __global__ functions C++ mangles static functions with L prefix (_ZL4init vs _ZL19init) Host and device use different name lengths after L Runtime now strips _Z, L, and digits before comparing (stripPrefix) - Host-side device variable read-back after kernel execution Reads DeviceVariables from kernel_meta.log + lookup_global_symbols.txt After each kernel launch, reads device vars via vx_copy_from_dev Updates memcpy_symbol entries so next launch restores them after BSS zeroing - lower_constant_expr: per-instruction operand replacement - KernelTranslation.cpp: flush fout before generate_wrapper Tested: dotproduct PASSED (no regression), cc-cuda kernel name matching fixed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… var persistence - mem_device2global: convert __device__ variables from addrspace(1) to 0 Filters out CUDA builtins, wrapper_global_, and declarations Uses ExternalLinkage so wrapper can reference device symbols Writes DeviceVariables to kernel_meta.log for host runtime - Fix kernel name matching for static __global__ functions C++ mangles static functions with L prefix (_ZL4init vs _ZL19init) Host and device use different name lengths after L Runtime now strips _Z, L, and digits before comparing (stripPrefix) - Host-side device variable read-back after kernel execution Reads DeviceVariables from kernel_meta.log + lookup_global_symbols.txt After each kernel launch, reads device vars via vx_copy_from_dev Updates memcpy_symbol entries so next launch restores them after BSS zeroing - lower_constant_expr: per-instruction operand replacement - KernelTranslation.cpp: flush fout before generate_wrapper Tested: dotproduct PASSED (no regression), cc-cuda kernel name matching fixed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nation - cfd: guard device_var read-back with if(device_vars_registered) to prevent reading back all memcpy_symbol entries when no device vars registered - cc-cuda: complete __device__ variable persistence pipeline: - stripMangle for nm -C demangled name matching - vx_copy_from_dev via kernel buffer handle (BSS read-back) - Small test graph (16 nodes, 2 components) for CI - score SCHE_0: add post-warp-loop domination fixup using DemoteRegToStack for SSA values broken by inter_warp loop restructuring - generate_wrapper: fix null string crash when VORTEX_ARCHITECTURE unset - init.cpp: revert lower_constant_expr to per-type (load/store/GEP/atomicrmw) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NUM_CLASSES=4, NUM_PRIORS=64, BATCH_SIZE=1 (was 16/256/4). Score SCHE_0 still under investigation — warp shuffle scan context save issue with multi-warp inter_warp loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- warp_func.cpp: emit cupbop.shfl.barrier instead of barrier0 for shfl emulation in FLAT mode. This prevents split_block_by_sync from breaking if-guard structure (e.g., if(threadIdx.x < WARP_SIZE)). - insert_warp_loop.cpp: recognize cupbop.shfl.barrier in remove_barrier, convert to real barrier0 after parallel regions are determined. getParallelRegionBefore ignores cupbop.shfl.barrier so scan code stays in same region as if guard. - huffman: CI input changed to test32k.in (was test_tiny_1k.in which caused pack2 grid=0, never executing the packing kernel). Score SCHE_0 still under investigation — histogram produces different results from SCHE_2 (bins[0]=38 vs 10), unrelated to shfl barrier. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

parnenziniGT and others added 30 commits February 13, 2024 20:33

Compiling kernel.bc

b5e3a56

checking errors

53cd7b2

Adding compilation kernel.bc

9c423ea

Fix issue compilation kernel

954119c

Fix issue compilation kernel - againwq

0bb31bc

Fix LLVM_PREFIX

0192f82

Adding compilation kernel wrapper

0203138

Fix Vortex path

0bd34e8

Reverting Vortex path

c7e6c30

Investigation vx_path.h

15d1ce7

Fix 2

25d6225

Fix 3

5a5a490

Fix 4

d0b6115

Fix 5

a8f1b97

Fix 6

76d592b

Change directory Vortex

1559abe

Fixed path for build

4d5347a

Added installation dependencies

0f91cc2

Added risc-v toolchain

96f946f

Added log messages

dde06b3

Added file for log messages

cadc730

Changed sysroot path

10e60cd

Added compiling kernel.elf

88ff5b6

Complete kernel compilation

e1927c9

Complete host compilation

96dfda8

Complete execution

88b5246

Disable lint.yml

ddbf388

Checking version of Node.js used

b53cabb

chihyoA and others added 29 commits March 30, 2026 22:00

Force clean+rebuild of kernel .o files in cmake build

130754d

Ensures cudaKernelImpl_64.o is always rebuilt from source, preventing stale .o files from causing link errors (e.g. missing __nv_erfcf in blackscholes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Guard debug prints with cupbop_debug() instead of removing

94ef565

performance.cpp: wrap analysis registration/running prints cg_sync.cpp: wrap CG Sync detection print Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update nw golden file for CI_RUN_ARGS=32, restore golden diff check

4d315c2

Golden generated from NVIDIA GPU with args "32 10". Reverts CI correctness check back to golden diff comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix cg_sync.cpp build error: remove cupbop_debug() call

2f9557a

cupbop_debug() is static in tool.cpp, not accessible from cg_sync.cpp. Remove the debug print entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move cupbop_debug() to tool.h so all source files can use it

e802390

Was static in tool.cpp, causing build errors in cg_sync.cpp and performance.cpp which also call cupbop_debug(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove duplicate static cupbop_debug() from 5 source files

8d43536

Now using shared inline definition from tool.h. Fixes redefinition error in generate_wrapper.cpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip stale perf entries in CI summary (status N/A → skip)

e958e87

Previous CI runs leave Perf_counter files that don't have matching ci_status files for the current run, showing as ❓. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean previous CI result files before build

49691cc

Removes ci_status, ci_output, Perf_counter, result_vortex files from all examples before building, so summary only shows current run results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI: update LLVM path to llvm_edited (with LowerSwitch fix)

1b6b5a7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix link to BenchmarkList in README.md

71d0d86

ChihyoA force-pushed the master branch from ae45ca6 to 6917be5 Compare April 15, 2026 04:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix link to BenchmarkList in README.md#1

Fix link to BenchmarkList in README.md#1
zqb-all wants to merge 403 commits intocupbop:masterfrom
zqb-all:patch-1

zqb-all commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

zqb-all commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants