Skip to content

Support optional subsystem ID matching in device filters - vGPU#167

Open
mattwittwer wants to merge 5 commits intoNVIDIA:mainfrom
mattwittwer:mwittwer/add-vgpu-subsystem-id-matching
Open

Support optional subsystem ID matching in device filters - vGPU#167
mattwittwer wants to merge 5 commits intoNVIDIA:mainfrom
mattwittwer:mwittwer/add-vgpu-subsystem-id-matching

Conversation

@mattwittwer
Copy link
Copy Markdown

Summary

  • add optional subsystem ID matching for device-filter

Why

Some GPUs share the same PCI vendor/device ID but differ by subsystem ID, which can cause the wrong config block to match and fail validation. This allows config authors to disambiguate those devices without breaking existing configs.

Changes

  • update go-nvlib PCI discovery to read subsystem vendor/device IDs
  • extend DeviceID parsing and matching logic to optionally include subsystem IDs

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mattwittwer mattwittwer changed the title Support optional subsystem ID matching in device filters Support optional subsystem ID matching in vGPU device filters Mar 10, 2026
@mattwittwer mattwittwer changed the title Support optional subsystem ID matching in vGPU device filters Support optional subsystem ID matching in device filters - vGPU Mar 10, 2026
@karthikvetrivel
Copy link
Copy Markdown
Member

Hey @mattwittwer! Did you get the chance to test this on two GPUs with identical PCI device IDs? Did you get the chance to test this change on an A2/A16?

The core logic looks sound and reasonable to me. Do you think you'd also be able to add some tests? Namely, to DeviceID construction, NewDeviceIDFromString, and most importantly, Matches()?

@mattwittwer
Copy link
Copy Markdown
Author

Hey @mattwittwer! Did you get the chance to test this on two GPUs with identical PCI device IDs? Did you get the chance to test this change on an A2/A16?

The core logic looks sound and reasonable to me. Do you think you'd also be able to add some tests? Namely, to DeviceID construction, NewDeviceIDFromString, and most importantly, Matches()?

Hi @karthikvetrivel! I have not been able to try this out directly with A2 and A16 GPUs on the same cluster. The user who reported this issue was able to return their nodes to production by manually setting the labels.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @rajathagasthya said on the mig-parted PR, this change needs to be made in https://github.com/NVIDIA/go-nvlib and vendored here with a go mod dependency update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants