[WIP] Guard host managed-memory access on concurrentManagedAccess=0 by rwgk · Pull Request #1769 · NVIDIA/cuda-python

rwgk · 2026-03-16T17:39:37Z

This PR is:

Guard host managed-memory access on CMA=0
Add a small helper (in helpers/buffers.py) that calls Device.sync() (or
otherwise ensures no work is in flight) before any host memset/memcmp of
managed memory when concurrentManagedAccess == 0. This is targeted and
keeps behavior unchanged on CMA=1 systems.

Guard host-side memset/memcmp in test helpers on CMA=0 by syncing the device before touching managed allocations. Made-with: Cursor

copy-pr-bot · 2026-03-16T17:39:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-03-16T20:30:02Z

There are no flakes in 100 trials with this PR at commit b611a87:

smc120-0009.ipp2a2.colossus.nvidia.com:/home/scratch.rgrossekunst_sw/logs_mirror/rdc-gitbash/logs/qa_tests_multi_pr1769_commit_001_b611a870 $ analyze_qa_tests_logs.py trial*log.txt
================================================================================
QA Test Logs Analysis Summary
================================================================================

Total files analyzed: 100
Files with no flakes (all passed): 100
Files with failures: 0
Files with errors: 0
Files with crashes: 0

✓ All files have no flakes, errors, or crashes - all tests passed!

================================================================================
Overall Statistics
================================================================================

Total tests passed (across all files): 325916
Total tests failed (across all files): 0
Total tests skipped (across all files): 28084
Total test errors (across all files): 0

================================================================================
SKIPPED Summary
================================================================================

   800  SKIPPED [1] tests\test_nvfatbin.py:304 - nvcc found on PATH but failed to compile a trivial input.
   600  SKIPPED [1] tests\example_tests\utils.py:43: skip C - \Users\rgrossekunst\wrk\forked\cuda-python\cuda_core\tests\example_tests\..\..\examples\strided_memory_view_cpu.py
   200  SKIPPED [1] tests\example_tests\utils.py:37 - torch not installed, skipping related tests
   200  SKIPPED [1] tests\nvml\test_compute_mode.py:20 - Test not supported on Windows
   200  SKIPPED [1] tests\nvml\test_device.py:148 - No permission to set power management limit
   200  SKIPPED [1] tests\nvml\test_device.py:165 - No permission to set temperature threshold
   200  SKIPPED [1] tests\nvml\test_init.py:38 - Test not supported on Windows
   200  SKIPPED [1] tests\nvml\test_page_retirement.py:47 - device doesn't support ECC for NNNNNNNNNNNNNNN
   200  SKIPPED [1] tests\nvml\test_page_retirement.py:75 - page_retirement not supported for NNNNNNNNNNNNNNN
   200  SKIPPED [1] tests\nvml\test_pynvml.py:53 - No MIG devices found
   200  SKIPPED [1] tests\test_cufile.py:19: could not import 'cuda.bindings.cufile' - No module named 'cuda.bindings.cufile'
   200  SKIPPED [2] tests\nvml\test_pynvml.py:66 - Not supported on WSL or Windows
   200  SKIPPED [2] tests\nvml\test_pynvml.py:77 - Not supported on WSL or Windows
   200  SKIPPED [6] tests\example_tests\utils.py:37 - cupy not installed, skipping related tests
   200  SKIPPED [9] cuda\bindings\_test_helpers\arch_check.py:55 - Unsupported call for device architecture AMPERE on device 'NVIDIA RTX ANNNN'
   100  SKIPPED (Two or more
   100  SKIPPED [18] tests\test_utils.py:486: could not import 'cupy' - No module named 'cupy'
   100  SKIPPED [1] examples\0_Introduction\simpleP2P_test.py:48 - Two or more GPUs with Peer-to-Peer access capability are required
   100  SKIPPED [1] examples\0_Introduction\systemWideAtomics_test.py:172 - Atomics not supported on Windows
   100  SKIPPED [1] tests\graph\test_device_launch.py:133 - Device-side graph launch requires Hopper (sm_90+) architecture
   100  SKIPPED [1] tests\graph\test_device_launch.py:81 - Device-side graph launch requires Hopper (sm_90+) architecture
   100  SKIPPED [1] tests\memory_ipc\test_event_ipc.py:98 - Device does not support IPC
   100  SKIPPED [1] tests\memory_ipc\test_peer_access.py:23 - Test requires at least N GPUs
   100  SKIPPED [1] tests\system\test_nvml_context.py:54 - Probably a non-WSL system
   100  SKIPPED [1] tests\system\test_system_device.py:107 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:263 - Events not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:313 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:343 - Persistence mode not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:407 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:462 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:477 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:492 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:502 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:97 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_events.py:16 - System events not supported on WSL or Windows
   100  SKIPPED [1] tests\test_device.py:367 - Test requires at least 2 CUDA devices
   100  SKIPPED [1] tests\test_device.py:415 - Test requires at least 2 CUDA devices
   100  SKIPPED [1] tests\test_launcher.py:123 - Driver or GPU not new enough for thread block clusters
   100  SKIPPED [1] tests\test_launcher.py:93 - Driver or GPU not new enough for thread block clusters
   100  SKIPPED [1] tests\test_linker.py:114 - nvjitlink requires lto for ptx linking
   100  SKIPPED [1] tests\test_linker.py:204 - driver backend test
   100  SKIPPED [1] tests\test_load_nvidia_dynamic_lib_using_mocker.py:105 - Windows support for cupti not yet implemented
   100  SKIPPED [1] tests\test_load_nvidia_dynamic_lib_using_mocker.py:137 - Windows support for cupti not yet implemented
   100  SKIPPED [1] tests\test_load_nvidia_dynamic_lib_using_mocker.py:58 - Windows support for cupti not yet implemented
   100  SKIPPED [1] tests\test_memory.py:1188 - IPC not implemented for Windows
   100  SKIPPED [1] tests\test_memory.py:1272 - IPC not implemented for Windows
   100  SKIPPED [1] tests\test_memory.py:1299 - IPC not implemented for Windows
   100  SKIPPED [1] tests\test_memory.py:1449 - Driver rejects IPC-enabled mempool creation on this platform
   100  SKIPPED [1] tests\test_memory.py:864 - This test requires a device that doesn't support GPU Direct RDMA
   100  SKIPPED [1] tests\test_memory_peer_access.py:14 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_memory_peer_access.py:147 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_memory_peer_access.py:51 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_memory_peer_access.py:84 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_module.py:405 - Device with compute capability 90 or higher is required for cluster support
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:101 - Device does not support IPC
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:124 - Device does not support IPC
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:22 - Device does not support IPC
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:49 - Device does not support IPC
   100  SKIPPED [1] tests\test_program.py:211 - device_float128 requires sm_100 or later
   100  SKIPPED [1] tests\test_utils.py - CuPy is not installed
   100  SKIPPED [1] tests\test_utils.py:220 - CuPy is not installed
   100  SKIPPED [1] tests\test_utils.py:510: could not import 'cupy' - No module named 'cupy'
   100  SKIPPED [1] tests\test_utils.py:618 - CuPy is not installed
   100  SKIPPED [1] tests\test_utils_env_vars.py:135 - Exercising symlinks intentionally omitted for simplicity
   100  SKIPPED [1] tests\test_utils_env_vars.py:173 - Exercising symlinks intentionally omitted for simplicity
   100  SKIPPED [24] tests\memory_ipc\test_leaks.py:82 - mempool allocation handle is not using fds or psutil is unavailable
   100  SKIPPED [27] tests\conftest.py:57 - Device does not support managed memory pool operations
   100  SKIPPED [2] tests\memory_ipc\test_event_ipc.py:114 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_event_ipc.py:21 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_ipc_duplicate_import.py:64 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_leaks.py:26 - mempool allocation handle is not using fds or psutil is unavailable
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:111 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:162 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:18 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:60 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_peer_access.py:62 - Test requires at least N GPUs
   100  SKIPPED [2] tests\memory_ipc\test_send_buffers.py:20 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_send_buffers.py:72 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_serialize.py:138 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_serialize.py:26 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_serialize.py:82 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_workerpool.py:112 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_workerpool.py:30 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_workerpool.py:67 - Device does not support IPC
   100  SKIPPED [2] tests\test_launcher.py:285 - cupy not installed
   100  SKIPPED [2] tests\test_module.py:390 - Device with compute capability 90 or higher is required for cluster support
   100  SKIPPED [2] tests\test_object_protocols.py:317 - requires multi-GPU
   100  SKIPPED [2] tests\test_object_protocols.py:357 - requires multi-GPU
   100  SKIPPED [2] tests\test_utils.py:453 - CuPy is not installed
   100  SKIPPED [2] tests\test_utils.py:634 - CuPy is not installed
   100  SKIPPED [2] tests\test_utils.py:665 - PyTorch is not installed
   100  SKIPPED [2] tests\test_utils.py:702 - CuPy is not installed
   100  SKIPPED [3] tests\graph\test_capture_alloc.py:149 - auto_free_on_launch not supported on Windows
   100  SKIPPED [3] tests\test_utils.py - got empty parameter set for (in_arr, use_stream)
   100  SKIPPED [4] tests\test_utils.py:416 - CuPy is not installed
   100  SKIPPED [6] ..\cuda_bindings\cuda\bindings\_test_helpers\arch_check.py:55 - Unsupported call for device architecture AMPERE on device 'NVIDIA RTX ANNNN'
   100  SKIPPED [7] tests\test_module.py:346 - Test requires numba to be installed
   100  SKIPPED [8] tests\memory_ipc\test_errors.py:22 - Device does not support IPC
   100  SKIPPED [8] tests\memory_ipc\test_event_ipc.py:132 - Device does not support IPC
    73  SKIPPED [1] tests\test_graphics.py:62: Could not create GL context/buffer: TypeError - Argument 'itemsize' has incorrect type (expected int, got getset_descriptor)
    30  SKIPPED [1] tests\test_graphics.py:126: Could not create GL context/texture: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     7  SKIPPED [3] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     6  SKIPPED [12] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     6  SKIPPED [4] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     5  SKIPPED [5] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [15] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [2] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [7] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [8] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [1] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [6] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     2  SKIPPED [11] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     2  SKIPPED [13] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     2  SKIPPED [9] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     1  SKIPPED [10] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     1  SKIPPED [14] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.

Additional sanity check: grep -a OSError *.txt (no output)

rwgk · 2026-03-16T22:20:37Z

Surprise: There are also no flakes with main at commit 3ed5217 (what this PR is based on):

smc120-0009.ipp2a2.colossus.nvidia.com:/home/scratch.rgrossekunst_sw/logs_mirror/rdc-gitbash/logs/qa_tests_multi_cuda-python_main_at_3ed52171 $ analyze_qa_tests_logs.py trial*log.txt
================================================================================
QA Test Logs Analysis Summary
================================================================================

Total files analyzed: 100
Files with no flakes (all passed): 100
Files with failures: 0
Files with errors: 0
Files with crashes: 0

✓ All files have no flakes, errors, or crashes - all tests passed!

================================================================================
Overall Statistics
================================================================================

Total tests passed (across all files): 325987
Total tests failed (across all files): 0
Total tests skipped (across all files): 28015
Total test errors (across all files): 0

================================================================================
SKIPPED Summary
================================================================================

   800  SKIPPED [1] tests\test_nvfatbin.py:304 - nvcc found on PATH but failed to compile a trivial input.
   600  SKIPPED [1] tests\example_tests\utils.py:43: skip C - \Users\rgrossekunst\wrk\forked\cuda-python\cuda_core\tests\example_tests\..\..\examples\thread_block_cluster.py
   200  SKIPPED [1] tests\example_tests\utils.py:37 - torch not installed, skipping related tests
   200  SKIPPED [1] tests\nvml\test_compute_mode.py:20 - Test not supported on Windows
   200  SKIPPED [1] tests\nvml\test_device.py:148 - No permission to set power management limit
   200  SKIPPED [1] tests\nvml\test_device.py:165 - No permission to set temperature threshold
   200  SKIPPED [1] tests\nvml\test_init.py:38 - Test not supported on Windows
   200  SKIPPED [1] tests\nvml\test_page_retirement.py:47 - device doesn't support ECC for NNNNNNNNNNNNNNN
   200  SKIPPED [1] tests\nvml\test_page_retirement.py:75 - page_retirement not supported for NNNNNNNNNNNNNNN
   200  SKIPPED [1] tests\nvml\test_pynvml.py:53 - No MIG devices found
   200  SKIPPED [1] tests\test_cufile.py:19: could not import 'cuda.bindings.cufile' - No module named 'cuda.bindings.cufile'
   200  SKIPPED [2] tests\nvml\test_pynvml.py:66 - Not supported on WSL or Windows
   200  SKIPPED [2] tests\nvml\test_pynvml.py:77 - Not supported on WSL or Windows
   200  SKIPPED [6] tests\example_tests\utils.py:37 - cupy not installed, skipping related tests
   200  SKIPPED [9] cuda\bindings\_test_helpers\arch_check.py:55 - Unsupported call for device architecture AMPERE on device 'NVIDIA RTX ANNNN'
   100  SKIPPED (Two or more
   100  SKIPPED [18] tests\test_utils.py:486: could not import 'cupy' - No module named 'cupy'
   100  SKIPPED [1] examples\0_Introduction\simpleP2P_test.py:48 - Two or more GPUs with Peer-to-Peer access capability are required
   100  SKIPPED [1] examples\0_Introduction\systemWideAtomics_test.py:172 - Atomics not supported on Windows
   100  SKIPPED [1] tests\graph\test_device_launch.py:133 - Device-side graph launch requires Hopper (sm_90+) architecture
   100  SKIPPED [1] tests\graph\test_device_launch.py:81 - Device-side graph launch requires Hopper (sm_90+) architecture
   100  SKIPPED [1] tests\memory_ipc\test_event_ipc.py:98 - Device does not support IPC
   100  SKIPPED [1] tests\memory_ipc\test_peer_access.py:23 - Test requires at least N GPUs
   100  SKIPPED [1] tests\system\test_nvml_context.py:54 - Probably a non-WSL system
   100  SKIPPED [1] tests\system\test_system_device.py:107 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:263 - Events not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:313 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:343 - Persistence mode not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:407 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:462 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:477 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:492 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:502 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_device.py:97 - Device attributes not supported on WSL or Windows
   100  SKIPPED [1] tests\system\test_system_events.py:16 - System events not supported on WSL or Windows
   100  SKIPPED [1] tests\test_device.py:367 - Test requires at least 2 CUDA devices
   100  SKIPPED [1] tests\test_device.py:415 - Test requires at least 2 CUDA devices
   100  SKIPPED [1] tests\test_launcher.py:123 - Driver or GPU not new enough for thread block clusters
   100  SKIPPED [1] tests\test_launcher.py:93 - Driver or GPU not new enough for thread block clusters
   100  SKIPPED [1] tests\test_linker.py:114 - nvjitlink requires lto for ptx linking
   100  SKIPPED [1] tests\test_linker.py:204 - driver backend test
   100  SKIPPED [1] tests\test_load_nvidia_dynamic_lib_using_mocker.py:105 - Windows support for cupti not yet implemented
   100  SKIPPED [1] tests\test_load_nvidia_dynamic_lib_using_mocker.py:137 - Windows support for cupti not yet implemented
   100  SKIPPED [1] tests\test_load_nvidia_dynamic_lib_using_mocker.py:58 - Windows support for cupti not yet implemented
   100  SKIPPED [1] tests\test_memory.py:1188 - IPC not implemented for Windows
   100  SKIPPED [1] tests\test_memory.py:1272 - IPC not implemented for Windows
   100  SKIPPED [1] tests\test_memory.py:1299 - IPC not implemented for Windows
   100  SKIPPED [1] tests\test_memory.py:1449 - Driver rejects IPC-enabled mempool creation on this platform
   100  SKIPPED [1] tests\test_memory.py:864 - This test requires a device that doesn't support GPU Direct RDMA
   100  SKIPPED [1] tests\test_memory_peer_access.py:14 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_memory_peer_access.py:147 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_memory_peer_access.py:51 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_memory_peer_access.py:84 - Test requires at least N GPUs
   100  SKIPPED [1] tests\test_module.py:405 - Device with compute capability 90 or higher is required for cluster support
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:101 - Device does not support IPC
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:124 - Device does not support IPC
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:22 - Device does not support IPC
   100  SKIPPED [1] tests\test_multiprocessing_warning.py:49 - Device does not support IPC
   100  SKIPPED [1] tests\test_program.py:211 - device_float128 requires sm_100 or later
   100  SKIPPED [1] tests\test_utils.py - CuPy is not installed
   100  SKIPPED [1] tests\test_utils.py:220 - CuPy is not installed
   100  SKIPPED [1] tests\test_utils.py:510: could not import 'cupy' - No module named 'cupy'
   100  SKIPPED [1] tests\test_utils.py:618 - CuPy is not installed
   100  SKIPPED [1] tests\test_utils_env_vars.py:135 - Exercising symlinks intentionally omitted for simplicity
   100  SKIPPED [1] tests\test_utils_env_vars.py:173 - Exercising symlinks intentionally omitted for simplicity
   100  SKIPPED [24] tests\memory_ipc\test_leaks.py:82 - mempool allocation handle is not using fds or psutil is unavailable
   100  SKIPPED [27] tests\conftest.py:57 - Device does not support managed memory pool operations
   100  SKIPPED [2] tests\memory_ipc\test_event_ipc.py:114 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_event_ipc.py:21 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_ipc_duplicate_import.py:64 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_leaks.py:26 - mempool allocation handle is not using fds or psutil is unavailable
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:111 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:162 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:18 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_memory_ipc.py:60 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_peer_access.py:62 - Test requires at least N GPUs
   100  SKIPPED [2] tests\memory_ipc\test_send_buffers.py:20 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_send_buffers.py:72 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_serialize.py:138 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_serialize.py:26 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_serialize.py:82 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_workerpool.py:112 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_workerpool.py:30 - Device does not support IPC
   100  SKIPPED [2] tests\memory_ipc\test_workerpool.py:67 - Device does not support IPC
   100  SKIPPED [2] tests\test_launcher.py:285 - cupy not installed
   100  SKIPPED [2] tests\test_module.py:390 - Device with compute capability 90 or higher is required for cluster support
   100  SKIPPED [2] tests\test_object_protocols.py:317 - requires multi-GPU
   100  SKIPPED [2] tests\test_object_protocols.py:357 - requires multi-GPU
   100  SKIPPED [2] tests\test_utils.py:453 - CuPy is not installed
   100  SKIPPED [2] tests\test_utils.py:634 - CuPy is not installed
   100  SKIPPED [2] tests\test_utils.py:665 - PyTorch is not installed
   100  SKIPPED [2] tests\test_utils.py:702 - CuPy is not installed
   100  SKIPPED [3] tests\graph\test_capture_alloc.py:149 - auto_free_on_launch not supported on Windows
   100  SKIPPED [3] tests\test_utils.py - got empty parameter set for (in_arr, use_stream)
   100  SKIPPED [4] tests\test_utils.py:416 - CuPy is not installed
   100  SKIPPED [6] ..\cuda_bindings\cuda\bindings\_test_helpers\arch_check.py:55 - Unsupported call for device architecture AMPERE on device 'NVIDIA RTX ANNNN'
   100  SKIPPED [7] tests\test_module.py:346 - Test requires numba to be installed
   100  SKIPPED [8] tests\memory_ipc\test_errors.py:22 - Device does not support IPC
   100  SKIPPED [8] tests\memory_ipc\test_event_ipc.py:132 - Device does not support IPC
    73  SKIPPED [1] tests\test_graphics.py:62: Could not create GL context/buffer: TypeError - Argument 'itemsize' has incorrect type (expected int, got getset_descriptor)
    22  SKIPPED [1] tests\test_graphics.py:126: Could not create GL context/texture: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     7  SKIPPED [9] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     6  SKIPPED [4] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [1] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [2] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     4  SKIPPED [7] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [10] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [11] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [12] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [3] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     3  SKIPPED [8] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     2  SKIPPED [13] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     1  SKIPPED [14] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     1  SKIPPED [15] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.
     1  SKIPPED [6] tests\test_graphics.py:62: Could not create GL context/buffer: CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using : - cuCtxFromGreenCtx API.

Additional sanity check: grep -a OSError *.txt (no output)

rwgk · 2026-03-16T22:25:31Z

Note:

I did not rebuild between running the tests reported under

[WIP] Guard host managed-memory access on concurrentManagedAccess=0 #1769 (comment) (this PR)
[WIP] Guard host managed-memory access on concurrentManagedAccess=0 #1769 (comment) (main)

I.e. everything was exactly identical, except for the presence/absence of commit b611a87. This is reflected in all log files, e.g.:

/home/scratch.rgrossekunst_sw/logs_mirror/rdc-gitbash/logs/qa_tests_multi_pr1769_commit_001_b611a870/trial_001_2026-03-16+105613_log.txt

C:\Users\rgrossekunst\wrk\forked\cuda-python>git --no-pager log -n 1
commit b611a8705cded7a6b83ad9fb518198fb71503fcd
Author: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>
Date:   Mon Mar 16 10:36:36 2026 -0700

    Sync device before host access to managed buffers

    Guard host-side memset/memcmp in test helpers on CMA=0 by syncing the
    device before touching managed allocations.

    Made-with: Cursor

C:\Users\rgrossekunst\wrk\forked\cuda-python>git --no-pager status
On branch guard_host_managed-memory_access_on_CMA_zero
Your branch is up to date with 'origin/guard_host_managed-memory_access_on_CMA_zero'.

nothing to commit, working tree clean

/home/scratch.rgrossekunst_sw/logs_mirror/rdc-gitbash/logs/qa_tests_multi_cuda-python_main_at_3ed52171/trial_001_2026-03-16+131641_log.txt

C:\Users\rgrossekunst\wrk\forked\cuda-python>git --no-pager log -n 1
commit 3ed52171d2fdbbec08a393c00e1d7ae7e5b16d7d
Author: Andy Jost <ajost@nvidia.com>
Date:   Mon Mar 16 08:39:35 2026 -0700

    Infrastructure changes preparing for explicit graph construction (#1762)

C:\Users\rgrossekunst\wrk\forked\cuda-python>git --no-pager status
On branch main
Your branch is up to date with 'upstream/main'.

nothing to commit, working tree clean

rwgk · 2026-03-16T22:29:51Z

I don't know what changed, but I cannot reproduce the flakes anymore. All details are in the log files under /home/scratch.rgrossekunst_sw/logs_mirror/rdc-gitbash/logs. (See cuda-python-private issues 235 and 245 for pointers to log files with flakes.)

Closing this PR and #1576 for now. If we see the flakes again later, we can come back here.

rwgk · 2026-04-15T17:58:57Z

@rluo8 I reopened this PR after seeing your question regarding nvbug 5815123

I hope we can use this for testing on your machine(s).

(I'm leaving this in Draft mode for now.)

rwgk · 2026-04-23T19:00:35Z

@rluo8 I had Cursor GPT-5.4 Extra High Fast systematically look at the logs you sent me offline (for my own reference: cuda-python-logs_2026-04-20+212854.zip). I'm copy-pasting the Cursor findings below.

I think the conclusion is sufficiently strong: this PR does not help. I'll close it again for now.

Caveat: I didn't comb through the logs myself. From recent experience I have sufficient confidence that the GPT-5.4 Extra High Fast results are reliable.

Thanks for trying it out. The results mean we have to look for other solutions.

Analysis of Rui `cuda-python` Windows B100 logs

Date: 2026-04-23

Source directory analyzed: /wrk/rui-cuda-python-logs

Assumption used here: files with suffix _buffer are the runs with PR 1769 applied, per Rui's Slack note:

The logs with suffix _buffer applied PR1769. The PR was applied based on main.

Short answer

From these logs, PR 1769 does not show a convincing improvement.

There is enough signal to say that PR 1769 does not materially fix Rui's flakiness, but there is not enough controlled signal to decide whether it has some smaller order-dependent effect.

The biggest reason for caution is that the test suite uses pytest-randomly, and the before/after runs were not run with the same recorded seeds. That means simple run-count comparisons are weak unless the actual failed-test sets also move in a consistent direction.

Main observations

The logs split naturally into two failure profiles:

A small persistent profile with only a few failures.
A large cascade profile with roughly 341-345 failed plus 102 errors on Python 3.11+, and 147 failed plus 4 errors on Python 3.10.

Those same two profiles appear both before and after PR 1769.

Separately, check_gpu_memory.log shows that this does not look like real memory exhaustion:

device memory allocations succeed up to 1 GiB
device-pool allocations succeed up to 1 GiB
pinned host allocations succeed up to 1 GiB
cuMemGetMemPool(MANAGED) fails with CUDA_ERROR_OUT_OF_MEMORY
cuGraphAddMemAllocNode fails with CUDA_ERROR_OUT_OF_MEMORY even for tiny allocations

So the driver symptom is unchanged across the observed test failures: this is the same managed/graph-memory path failing, not ordinary lack of VRAM.

Evidence that PR 1769 does not make a meaningful difference

1. The same two failure modes remain before and after

Across the baseline logs and the _buffer logs, the suite still alternates between:

a small failure set
a large order-dependent cascade

If PR 1769 were fixing the underlying issue, the expected pattern would be a clear reduction or elimination of one of those profiles. That is not what appears in these logs.

2. Python 3.11 is effectively unchanged

test_results_311.log:

Run totals: 3, 345, 3, 343, 3 failures

test_results_311_buffer.log:

Run totals: 3, 3, 343, 3, 345 failures

Interpretation:

before patch: 2/5 large-cascade runs
after patch: 2/5 large-cascade runs
the large failure/error set is the same
the small failure set is the same

This is strong evidence of no material change.

3. Python 3.12 is also effectively unchanged

test_results_312.log:

Run totals: 343, 341, 3, 3, 3

test_results_312_buffer.log:

Run totals: 3, 3, 3, 341, 3

Interpretation:

before patch: 2/5 large-cascade runs
after patch: 1/5 large-cascade runs
however, when the large cascade appears, the failed/error set is the same

This is at most a hint of order sensitivity, not evidence of a fix.

4. Python 3.13 does not improve, and may be slightly worse

test_results_313.log:

Run totals: 7, 342, 7, 342, 7

test_results_313_buffer.log:

Run totals: 7, 7, 344, 7, 344

Interpretation:

before patch: 2/5 large-cascade runs
after patch: 2/5 large-cascade runs
patched large runs are slightly larger (344 failed instead of 342)

That is inconsistent with a meaningful improvement.

5. The persistent baseline failures remain

The small profile remains after the patch. Depending on Python version, it includes the same baseline tests such as:

tests/test_memory.py::test_pinned_memory_resource_initialization
tests/test_memory.py::test_memory_resource_alloc_zero_bytes[PinnedMR]
tests/test_managed_memory_warning.py::test_default_pool_error_without_concurrent_access

On Python 3.13, the small profile also includes:

tests/graph/test_device_launch.py::test_device_launch_basic
tests/graph/test_device_launch.py::test_device_launch_multiple
tests/test_memory.py::test_non_managed_resources_report_not_managed[device]
tests/test_memory.py::test_non_managed_resources_report_not_managed[pinned]

These do not disappear with PR 1769.

Evidence that PR 1769 might make some difference

There is one weaker signal that could tempt a positive interpretation:

across Python 3.10 through 3.13, the number of large-cascade runs drops from 9/20 before to 6/20 after
mean failures per run drop from about 127 to about 106

That is the best argument that PR 1769 might be helping somewhat.

However, this is not strong evidence, for several reasons:

The runs are not paired by identical random seeds.
The direction is not consistent across versions.
Python 3.10 is a major outlier and drives much of the apparent improvement.
Python 3.13 shows no reduction in large-run frequency at all.

So this can just as easily be ordinary order noise from pytest-randomly.

Why Python 3.10 should not be over-interpreted

Python 3.10 is the one version where the before/after large profiles differ a lot:

test_results_310.log:

3, 147, 147, 2, 147

test_results_310_buffer.log:

345, 2, 2, 3, 3

That looks dramatic at first glance, but it does not support a clean “patch helps” story:

before patch: 3/5 large runs, but the large profile is smaller (147 failed + 4 errors)
after patch: 1/5 large runs, but that one large run is much worse (345 failed + 102 errors)

This is much more consistent with different randomized ordering reaching different fixture-triggered cascades than with an actual fix.

What the failed-test-set comparison shows

The strongest comparison is not the run count but the failed/error case sets.

Results:

Python 3.11: baseline and patched have the same small-profile set and the same large-profile set.
Python 3.12: baseline and patched have the same small-profile set and the same large-profile set.
Python 3.13: baseline and patched have the same small-profile set; the large profile is effectively the same, with two extra patched failures.
Python 3.10: profiles differ substantially, which is exactly why it looks like order/coverage variation rather than a stable fix.

This failed-set comparison is the main reason to conclude that PR 1769 does not materially change the observed behavior.

Role of `pytest-randomly`

Because pytest-randomly is in use, a run can be dominated by whether the order happens to hit a module-scoped or shared fixture path that triggers the graph/managed-memory failure.

That means:

a run with 3 failures and a run with 343 failures can still come from the same underlying defect
comparing raw totals without controlling order is unreliable
identical or near-identical failure sets across before/after conditions are much more trustworthy than simple counts

The logs only explicitly show Using --randomly-seed=... for the Python 3.10 logs. I did not find recorded seeds for the other versions in these log files, so I cannot pair baseline and patched runs seed-by-seed.

Do we have enough signal to answer the core question?

Yes, for this practical question:

Does PR 1769 make a meaningful difference in Rui's logs?

Answer: No convincing evidence.

The logs do not support the claim that PR 1769 meaningfully fixes or clearly reduces the observed Windows B100 flakiness.

No, for this narrower question:

Does PR 1769 slightly reduce the probability of triggering the large cascade?

Answer: Not from these logs alone.

The sample is too weakly controlled:

only 5 runs per condition
randomized ordering
no seed pairing across before/after
mixed behavior across Python versions

Any small improvement claim would be speculative.

Bottom line

My conclusion from Rui's logs is:

PR 1769 does not make a convincing difference
PR 1769 also does not eliminate either failure profile
the strongest evidence points to no material fix
the remaining apparent improvements are explainable as pytest-randomly noise / order sensitivity

If a more decisive answer is needed, the right next experiment would be:

choose a set of fixed --randomly-seed values
run baseline and PR 1769 with the same seeds
compare failed/error case sets seed-by-seed

That would turn this from suggestive log reading into a controlled before/after comparison.

Sync device before host access to managed buffers

b611a87

Guard host-side memset/memcmp in test helpers on CMA=0 by syncing the device before touching managed allocations. Made-with: Cursor

rwgk closed this Mar 16, 2026

rwgk deleted the guard_host_managed-memory_access_on_CMA_zero branch March 16, 2026 22:30

rwgk mentioned this pull request Mar 16, 2026

[WIP] Skip tests using managed memory if CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS == 0 #1576

Closed

rwgk restored the guard_host_managed-memory_access_on_CMA_zero branch April 15, 2026 17:52

rwgk reopened this Apr 15, 2026

Merge branch 'main' into guard_host_managed-memory_access_on_CMA_zero

619093f

rwgk self-assigned this Apr 15, 2026

rwgk added bug Something isn't working P0 High priority - Must do! test Improvements or additions to tests cuda.core Everything related to the cuda.core module labels Apr 15, 2026

rwgk added this to the cuda.core v1.0.0 milestone Apr 15, 2026

rwgk closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Guard host managed-memory access on concurrentManagedAccess=0#1769

[WIP] Guard host managed-memory access on concurrentManagedAccess=0#1769
rwgk wants to merge 2 commits intoNVIDIA:mainfrom
rwgk:guard_host_managed-memory_access_on_CMA_zero

rwgk commented Mar 16, 2026

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rwgk commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rwgk commented Mar 16, 2026

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Mar 16, 2026

Uh oh!

rwgk commented Apr 15, 2026

Uh oh!

rwgk commented Apr 23, 2026

Analysis of Rui cuda-python Windows B100 logs

Short answer

Main observations

Evidence that PR 1769 does not make a meaningful difference

1. The same two failure modes remain before and after

2. Python 3.11 is effectively unchanged

3. Python 3.12 is also effectively unchanged

4. Python 3.13 does not improve, and may be slightly worse

5. The persistent baseline failures remain

Evidence that PR 1769 might make some difference

Why Python 3.10 should not be over-interpreted

What the failed-test-set comparison shows

Role of pytest-randomly

Do we have enough signal to answer the core question?

Yes, for this practical question:

No, for this narrower question:

Bottom line

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Analysis of Rui `cuda-python` Windows B100 logs

Role of `pytest-randomly`