Title
Memory diagnostic may report a GPU memory allocation error but still mark the test as PASS
DCGM Version
4.5
Component
DCGM / NVVS Memory diagnostic plugin
Summary
When the memory diagnostic cannot allocate enough GPU memory to meet its configured minimum allocation requirement, it can still complete with an overall PASS result even though it records/reports an error indicating a memory allocation failure (DCGM_FR_MEMORY_ALLOC).
The documentation additionally also states that this is a valid error code.
This creates an inconsistency between what the diagnostic reports (an allocation error occurred) and the final test outcome (PASS).
Expected Behavior
If the diagnostic cannot allocate the minimum required amount of GPU memory to run the test as configured, it should produce a final outcome consistent with that failure, e.g.:
- FAIL (recommended if the test’s required conditions are not met and this is what the DCGM docs states)
In any case, it should not record an allocation error and still return PASS for the same run.
Actual Behavior
The diagnostic can:
- detects
DCGM_FR_MEMORY_ALLOC (insufficient GPU memory could be allocated to satisfy the configured requirement), and
- still return a final PASS status for the memory test run (unless some other unrelated failure condition occurs).
Impact
- Users and automation may interpret the run as “healthy” based on PASS even though an error was logged.
- Error reporting becomes harder to trust because recorded errors may not correlate with the final result.
- This can mask real deployment/configuration issues (or memory pressure) that should be surfaced as non-PASS outcomes.
Reproduction (High-level)
- Configure the memory diagnostic to require a minimum fraction of GPU memory to be allocated.
- Run the diagnostic on a GPU where that minimum cannot be satisfied due to memory pressure/fragmentation/other allocations.
- Observe that the final test result is still PASS.
Title
Memory diagnostic may report a GPU memory allocation error but still mark the test as PASS
DCGM Version
4.5
Component
DCGM / NVVS Memory diagnostic plugin
Summary
When the memory diagnostic cannot allocate enough GPU memory to meet its configured minimum allocation requirement, it can still complete with an overall PASS result even though it records/reports an error indicating a memory allocation failure (
DCGM_FR_MEMORY_ALLOC).The documentation additionally also states that this is a valid error code.
This creates an inconsistency between what the diagnostic reports (an allocation error occurred) and the final test outcome (PASS).
Expected Behavior
If the diagnostic cannot allocate the minimum required amount of GPU memory to run the test as configured, it should produce a final outcome consistent with that failure, e.g.:
In any case, it should not record an allocation error and still return PASS for the same run.
Actual Behavior
The diagnostic can:
DCGM_FR_MEMORY_ALLOC(insufficient GPU memory could be allocated to satisfy the configured requirement), andImpact
Reproduction (High-level)