Skip to content

Hang detection feature introduced in 4.4.2 floods logs #279

@bsteinb

Description

@bsteinb

We recently noticed that after updating to DCGM 4.4.2, /var/log/nv-hostengine.log would grow to several gigabytes in size in a matter of a few days. This is due to messages produced by the hang detection feature introduced in 4.4.2 every 60 seconds that look like this:

2026-02-09 15:31:15.504 ERROR [5066:5080] Failed to compute current fingerprint for 24299/0 [/builds/dcgm/dcgm/common/HangDetect.cpp:193] [HangDetect::IsHungImpl]
2026-02-09 15:31:15.504 ERROR [5066:5080] Error checking hang state for process 24299: -14 [/builds/dcgm/dcgm/common/HangDetectMonitor.cpp:237] [HangDetectMonitor::CollectTaskUpdates]

This pair of messages is repeated every 60 seconds for a list of process IDs (I assume 24299 is a PID here) that grows over time. No processes exist under the reported PIDs. It seems like the mechanism leaks processes that it tries to track, but which just disappear and then the mechanism keeps complaining about them incessantly.

We found that we can trigger a process leak by running dcgm diag, -r 2 leaks one additional process ID, -r 3 leaks two. This exacerbates the issue for us, since we regularly run DCGM diagnostics as part of automated health checking. By the time we noticed the issue, the list of processes that the mechanism complains about every minute had grown to about a thousand, which results in 250 MB of log file growth per day. Not sure if actual compute workloads trigger the issue as well.

We reliably observe this issue on x86_64 on different GPUs from the Volta, Ampere, and Hopper generations. We have not seen the issue on Grace Hopper so far.

DCGM 4.5.2 still shows the issue.

We currently work around the issue by disabling the hang detection mechanism by setting the environment variable DCGM_HANGDETECT_DISABLE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions