Skip to content

Fix link to BenchmarkList in README.md#1

Open
zqb-all wants to merge 403 commits intocupbop:masterfrom
zqb-all:patch-1
Open

Fix link to BenchmarkList in README.md#1
zqb-all wants to merge 403 commits intocupbop:masterfrom
zqb-all:patch-1

Conversation

@zqb-all
Copy link
Copy Markdown

@zqb-all zqb-all commented Apr 7, 2026

No description provided.

parnenziniGT and others added 30 commits February 13, 2024 20:33
Due to the following warning messages:
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v2. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/

update build.yml to use Node.js 20
Since actions/checkout@v2 makes use of Node.js 16, we need to updgrade to use actions/checkout@v3 instead (which uses Node.js 20). As mentioned here:
https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/
Node.js 16 is going to be deprecated in Spring 2024.
chihyoA and others added 29 commits March 30, 2026 22:00
- CMakeLists.txt: forward LLVM_PREFIX, TOOLDIR, VORTEX_HOME,
  CuPBoP_PATH, RISCV_TOOLCHAIN to kernel Makefile so
  cudaKernelImpl_64.o is rebuilt during cmake build
- Remove pre-built .o files from git (they were stale and
  missing __nv_erfcf, causing blackscholes link failure)
- .o files already in .gitignore, will be ignored going forward

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes ~60+ errs()/dbgs() prints that dump IR to stderr
during normal compilation. Keeps cupbop_debug() guarded prints
and actual error messages.

Files: performance.cpp, memory_hierarchy.cpp, insert_sync.cpp,
insert_warp_loop.cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures cudaKernelImpl_64.o is always rebuilt from source,
preventing stale .o files from causing link errors
(e.g. missing __nv_erfcf in blackscholes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
performance.cpp: wrap analysis registration/running prints
cg_sync.cpp: wrap CG Sync detection print

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI_RUN_ARGS=32 to avoid timeout (was 40min+ at 128).
Correctness check changed from golden diff to result file
existence check since golden was generated at size 128.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Golden generated from NVIDIA GPU with args "32 10".
Reverts CI correctness check back to golden diff comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cupbop_debug() is static in tool.cpp, not accessible from
cg_sync.cpp. Remove the debug print entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was static in tool.cpp, causing build errors in cg_sync.cpp
and performance.cpp which also call cupbop_debug().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Now using shared inline definition from tool.h.
Fixes redefinition error in generate_wrapper.cpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous CI runs leave Perf_counter files that don't have
matching ci_status files for the current run, showing as ❓.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes ci_status, ci_output, Perf_counter, result_vortex files
from all examples before building, so summary only shows
current run results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tions

Indirect calls (e.g. llvm.lifetime.end) have null getCalledFunction().
Using getCalledOperand()->getName() on these crashes with
"dyn_cast on a non-existent value".

Fixed in:
- tool.cpp: InlineAsm null check (line 1057), has_warp_barrier,
  has_barrier, has_block_barrier
- insert_warp_loop.cpp: getParallelRegionBefore (line 1296),
  runOnFunction (line 1498)
- handle_sync.cpp: split_block_by_sync (line 37)
- insert_sync.cpp: InsertConditionalForBarrier (line 474)

All use getCalledFunction() instead of getCalledOperand() and
add !Call->getCalledFunction() null check.

Fixes wedford-cuda, score-cuda, marchingCubes-cuda SCHE_0 crashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…shes

- Add warp_shfl/warp_vote/vote_count definitions to generated kernel_wrapper.cpp
  for SCHE_0/1 (Vortex kernel had no definition, only CPU threadPool runtime did)
- Set setDSOLocal(true) on warp_shfl/warp_vote/vote_count in init.cpp (required
  for RISC-V TLS, matching other TLS globals)
- Fix sche_data.inner_loop_cond null crash: initialize inner loop variables in
  add_mapping_variable when need_nested_loop && intra_warp_loop
- Fix last 2 getCalledOperand()->getName() calls in insert_warp_loop.cpp
  (lines 1212, 1640) with null-check + getCalledFunction()

Tested: wedford-cuda, score-cuda, bh-cuda SCHE_0 all PASS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er detection

- tool.cpp: Fix threadIdx.y warp size constant 4→32, add %block_size_y for 3D
  blocks, implement threadIdx.z = flat_tid / (block_size_x * block_size_y)
- generate_wrapper.cpp: Include block_size_z in block_size calculation (x*y*z)
- warp_func.cpp: Add shfl idx and xor variants to replaceWarpShflFlat,
  detect C++ mangled __shfl*_sync functions in FLAT mode (declaration-only
  wrappers like _Z16__shfl_down_syncifii), remove dead shfl wrapper functions
  after FLAT replacement to prevent RISC-V backend crash
- marchingCubes-cuda: Reduce countingThreadNumLv2 1024→256 (Vortex HW limit 512)

Tested: wedford/score/bh SCHE_0 PASS, jacobi SCHE_0 builds with warp_shfl

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n errors)

The hasCppWrappers logic in replaceWarpShfl caused:
- score-cuda: replaceAllUses type mismatch assertion
- marchingCubes/jaccard: instruction domination errors
- srad_v2: LLVM UnifyLoopExits assertion

Revert to original NVVM-intrinsic-only detection path. C++ wrapper
calls (_Z16__shfl_down_sync etc) will be resolved at link time by
cudaKernelImpl. Keep shfl idx/xor variants and isShflCall lambda
for future use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add has_warp_shfl_usage() to detect warp_shfl global users and activate
  nested loop (intra + inter warp) for kernels using shuffle emulation
- Skip llvm.lifetime intrinsics in alloca divergence and non-PR checks
  (only when nested loop is active, to avoid regression on bh/backprop)
- Prevent double context save: skip allocas with _intra_warp_/_inter_warp_
  suffix in all three AddContextSaveRestore call sites
- Guard isDivergenceSource intra_warp_index check with need_nested_loop
  to avoid unnecessary divergence propagation in single-loop kernels
- Add inline_shfl_helpers() for transitive inlining of helper functions
  containing shfl calls, with barrier0 insertion after each inlined call
- Re-run inline_func_with_tid + remove_cuda_built_in after shfl helpers
- Skip shfl inline in inline_warp_level_func for SCHE_0 (handled by FLAT)
- Fix warp_shfl GEP: use builder.CreateGEP/CreateLoad directly to prevent
  constant folding that caused type mismatch in score-cuda
- Use barrier0 (CreateInterWarpBarrier) in replaceWarpShflFlat for proper
  parallel region construction

Tested: wedford block_y=1 SCHE_0 PASSED, dotproduct PASSED, backprop PASSED

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaccard outputs PASSED for weighted case and FAIL for unweighted case.
grep -q "PASSED" matched the first, ignoring the FAIL. Add ! grep "^FAIL"
check so any FAIL in ci_output.txt causes CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Guard inline_shfl_helpers with SCHE_0 check — it was running for SCHE_2
  too, causing timeout regressions (wedford SCHE_2: 1min → timeout)
- Restrict FLAT shfl replacement to kernel functions only — helper functions
  like parallel_prefix_sum had their shfl replaced but weren't wrapped by
  warp loop, causing domination errors
- Fix shfl operand domination: always save shfl_offset to alloca before
  barrier, load after barrier. split_block_by_sync splits at barrier0,
  separating operand definition from use. Save result to alloca with
  per-user loads for cross-BB result users.
- Skip shfl_* allocas in insert_warp_loop context save — these are shfl
  emulation temporaries, not user data. Converting them to per-thread
  arrays breaks shfl emulation.

Tested: jaccard SCHE_0 builds, wedford SCHE_2 PASSED, dotproduct SCHE_0 PASSED

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce JACOBI_N=16 MAX_ITERS=10 (was 32/100) to fit CI timeout
- Switch CI verification to golden comparison (was PASSED grep)
- Generate golden from CUDA GPU with same settings
- jacobi SCHE_2 PASSED (Error=0.0855817 matches GPU exactly)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add lowerswitch pass after kernelTranslator in common.mk to eliminate
  switch instructions that crash Vortex's UnifyLoopExits pass
- Add KERNEL_OPT variable (default -O3) to allow per-example override;
  srad_v2 uses -O0 to prevent CFGSimplification from recreating switches
- Expand lower_constant_expr to handle ConstantExpr in store value operand
  and atomicrmw pointer operand (needed for huffman LOCALMEM=0)
- Use per-instruction operand replacement instead of CE-wide user
  replacement to avoid cross-function domination errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce srad_v2 input to 16x16 (was 128x128) for CI timeout
- Generate golden_output.txt from CUDA GPU with matching args
- Revert KERNEL_OPT and lowerswitch workarounds in common.mk
  (fix is in Vortex LLVM: LowerSwitch moved before UnifyLoopExits)
- Revert common.mk Step 5 to original clang++ -O3 compilation

Requires Vortex LLVM patch: RISCVTargetMachine.cpp LowerSwitch
unconditionally + after CFGSimplification in divergence chain.

Tested: srad_v2 SCHE_0 PASSED (values match CUDA GPU golden)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA's __device__ qualifier puts global variables in addrspace(1), but
Vortex/RISC-V uses flat address space (0). Without conversion,
addrspacecast produces invalid pointers at runtime.

- New mem_device2global() in memory_hierarchy.cpp
- Filters out CUDA builtins (extern_weak), wrapper_global_ (from
  mem_share2global_sche_2), and declarations
- Called after mem_constant2global in init_block
- cc-cuda: addrspace(1) for topL/posL/topH/posH now converted
  (runtime error remains — separate issue)

Tested: dotproduct PASSED (no regression)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…var persistence

- mem_device2global: convert __device__ variables from addrspace(1) to 0
  Filters out CUDA builtins, wrapper_global_, and declarations
  Uses ExternalLinkage so wrapper can reference device symbols
  Writes DeviceVariables to kernel_meta.log for host runtime

- Fix kernel name matching for static __global__ functions
  C++ mangles static functions with L prefix (_ZL4init vs _ZL19init)
  Host and device use different name lengths after L
  Runtime now strips _Z, L, and digits before comparing (stripPrefix)

- Host-side device variable read-back after kernel execution
  Reads DeviceVariables from kernel_meta.log + lookup_global_symbols.txt
  After each kernel launch, reads device vars via vx_copy_from_dev
  Updates memcpy_symbol entries so next launch restores them after BSS zeroing

- lower_constant_expr: per-instruction operand replacement
- KernelTranslation.cpp: flush fout before generate_wrapper

Tested: dotproduct PASSED (no regression), cc-cuda kernel name matching fixed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… var persistence

- mem_device2global: convert __device__ variables from addrspace(1) to 0
  Filters out CUDA builtins, wrapper_global_, and declarations
  Uses ExternalLinkage so wrapper can reference device symbols
  Writes DeviceVariables to kernel_meta.log for host runtime

- Fix kernel name matching for static __global__ functions
  C++ mangles static functions with L prefix (_ZL4init vs _ZL19init)
  Host and device use different name lengths after L
  Runtime now strips _Z, L, and digits before comparing (stripPrefix)

- Host-side device variable read-back after kernel execution
  Reads DeviceVariables from kernel_meta.log + lookup_global_symbols.txt
  After each kernel launch, reads device vars via vx_copy_from_dev
  Updates memcpy_symbol entries so next launch restores them after BSS zeroing

- lower_constant_expr: per-instruction operand replacement
- KernelTranslation.cpp: flush fout before generate_wrapper

Tested: dotproduct PASSED (no regression), cc-cuda kernel name matching fixed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nation

- cfd: guard device_var read-back with if(device_vars_registered) to prevent
  reading back all memcpy_symbol entries when no device vars registered
- cc-cuda: complete __device__ variable persistence pipeline:
  - stripMangle for nm -C demangled name matching
  - vx_copy_from_dev via kernel buffer handle (BSS read-back)
  - Small test graph (16 nodes, 2 components) for CI
- score SCHE_0: add post-warp-loop domination fixup using DemoteRegToStack
  for SSA values broken by inter_warp loop restructuring
- generate_wrapper: fix null string crash when VORTEX_ARCHITECTURE unset
- init.cpp: revert lower_constant_expr to per-type (load/store/GEP/atomicrmw)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NUM_CLASSES=4, NUM_PRIORS=64, BATCH_SIZE=1 (was 16/256/4).
Score SCHE_0 still under investigation — warp shuffle scan
context save issue with multi-warp inter_warp loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- warp_func.cpp: emit cupbop.shfl.barrier instead of barrier0 for shfl
  emulation in FLAT mode. This prevents split_block_by_sync from breaking
  if-guard structure (e.g., if(threadIdx.x < WARP_SIZE)).
- insert_warp_loop.cpp: recognize cupbop.shfl.barrier in remove_barrier,
  convert to real barrier0 after parallel regions are determined.
  getParallelRegionBefore ignores cupbop.shfl.barrier so scan code stays
  in same region as if guard.
- huffman: CI input changed to test32k.in (was test_tiny_1k.in which
  caused pack2 grid=0, never executing the packing kernel).

Score SCHE_0 still under investigation — histogram produces different
results from SCHE_2 (bins[0]=38 vs 10), unrelated to shfl barrier.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants