Skip to content

Memory diagnostic may detect a GPU memory allocation error but still mark the test as PASS #284

@aditigaur4

Description

@aditigaur4

Title

Memory diagnostic may report a GPU memory allocation error but still mark the test as PASS

DCGM Version

4.5

Component

DCGM / NVVS Memory diagnostic plugin

Summary

When the memory diagnostic cannot allocate enough GPU memory to meet its configured minimum allocation requirement, it can still complete with an overall PASS result even though it records/reports an error indicating a memory allocation failure (DCGM_FR_MEMORY_ALLOC).

The documentation additionally also states that this is a valid error code.

This creates an inconsistency between what the diagnostic reports (an allocation error occurred) and the final test outcome (PASS).

Expected Behavior

If the diagnostic cannot allocate the minimum required amount of GPU memory to run the test as configured, it should produce a final outcome consistent with that failure, e.g.:

  • FAIL (recommended if the test’s required conditions are not met and this is what the DCGM docs states)

In any case, it should not record an allocation error and still return PASS for the same run.

Actual Behavior

The diagnostic can:

  • detects DCGM_FR_MEMORY_ALLOC (insufficient GPU memory could be allocated to satisfy the configured requirement), and
  • still return a final PASS status for the memory test run (unless some other unrelated failure condition occurs).

Impact

  • Users and automation may interpret the run as “healthy” based on PASS even though an error was logged.
  • Error reporting becomes harder to trust because recorded errors may not correlate with the final result.
  • This can mask real deployment/configuration issues (or memory pressure) that should be surfaced as non-PASS outcomes.

Reproduction (High-level)

  • Configure the memory diagnostic to require a minimum fraction of GPU memory to be allocated.
  • Run the diagnostic on a GPU where that minimum cannot be satisfied due to memory pressure/fragmentation/other allocations.
  • Observe that the final test result is still PASS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions