Skip to content

Create VGPU changes with VFIO Framework#139

Open
JunAr7112 wants to merge 12 commits intoNVIDIA:mainfrom
JunAr7112:vfio_changes
Open

Create VGPU changes with VFIO Framework#139
JunAr7112 wants to merge 12 commits intoNVIDIA:mainfrom
JunAr7112:vfio_changes

Conversation

@JunAr7112
Copy link
Copy Markdown
Contributor

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Nov 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 2 times, most recently from cbe892c to 3f31828 Compare November 10, 2025 17:09
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Copy link
Copy Markdown
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JunAr7112, this is a good start. As we make iterations on this and get more familiar with the internals here, it may be valuable to create a new internal/vgpu package that hides away the vfio vs mdev framework complexity. We need to think through what the right interface would be, but I imagine we will need methods for 1) getting all vGPU devices, 2) getting all parent devices (of which you can create a vGPU device on top of), 3) creating a vGPU device. The pkg/vgpu/config.go file, which is concerned with getting / setting a particular vGPU config, can invoke these methods without having to know what vfio / mdev is.

Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread pkg/vgpu/config.go Outdated
Comment thread cmd/nvidia-vgpu-dm/assert/config.go
Comment thread cmd/nvidia-vgpu-dm/apply/config.go
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 6 times, most recently from 5daf473 to 89a1e62 Compare November 25, 2025 21:03
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 9 times, most recently from 50e2185 to 58c7cc6 Compare December 9, 2025 22:31
Comment thread pkg/vgpu/config.go
Comment thread internal/vfio/vfio.go Outdated
Comment thread pkg/vgpu/config.go
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 3 times, most recently from d134551 to d49bf31 Compare January 16, 2026 00:25
Comment thread pkg/vgpu/config.go
if ret != nvml.SUCCESS {
continue
}
vgpuConfig[typeName]++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does the name reported by NVML, e.g. vgpuTypeId.GetName(), align exactly with the name we were using before? (the names stored in vgpuDev.MDEVType)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a helper function to ensure they exactly align

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the name reported by NVML contains the product prefix? For example, is NVML returning NVIDIA A100-4C as the type name for the A100-4C device?

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I checked this manually earlier and added the parseVGPUTypeName(rawName string) to verify that we would only be getting A100-4C. The name reported by NVML included a prefix.

Comment thread internal/vfio/vfio.go Outdated
Comment thread internal/vfio/vfio.go Outdated
Comment thread internal/vfio/vfio.go Outdated
vfnum := 0
numVF := int(device.SriovInfo.PhysicalFunction.NumVFs)
for vfnum < numVF {
vfAddr := filepath.Join(HostPCIDevicesRoot, device.Address, "virtfn"+strconv.Itoa(vfnum), "nvidia")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the nvidia directory should be included in this path. The "path to the VF" is simply /sys/bus/pci/devices/<BDF>/virtfn<N>. Other parts of the code are not intuitive to me because vfAddr includes the nvidia directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local-agadiyar@ipp1-2284:/sys/bus/pci/devices/0000:41:00.0/virtfn0/nvidia$ cat current_vgpu_type
687
local-agadiyar@ipp1-2284:/sys/bus/pci/devices/0000:41:00.0/virtfn0/nvidia$ cat creatable_vgpu_types
ID : vGPU Name

The current_vgpu_type and creatable_vgpu_types files are located in the nvidia folder. This way we don't need to append nvidia onto another address variable

Comment thread internal/vfio/vfio.go Outdated
return nil, fmt.Errorf("virtual function %d at address %s does not exist", vfnum, vfAddr)
}
parentDevices = append(parentDevices, &ParentDevice{
NvidiaPCIDevice: device,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Shouldn't NvidiaPCIDevice represent the VF (IIUC device is currently a PF)? If so, then I don't see the need to have VirtualFunctionaPath as a separate field. If we instead used the VF here, I think that would simplify the code in a few places and make this easier to read. Note, the nvpci.NvidiaPCIDevice type allows you to go from the VF to the backing PF via device.SriovInfo.VirtualFunction.PhysicalFunction.

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is that nvpci doesn't have a built in way to get virtual functions from the physical function. That is why I am storing the physical device ( via nvdevices, err := m.nvlib.Nvpci.GetGPUs() ) as well as the path to the virtual function.

Comment thread internal/vfio/vfio.go Outdated
}
devices := []*Device{}
for _, parentDevice := range parentDevices {
vgpuTypeNumberBytes, err := os.ReadFile(filepath.Join(parentDevice.VirtualFunctionPath, "current_vgpu_type"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated in a prior comment, if parentDevice was just of type nvpci.NvidiaPCIDevice (and represented the VF), this code would be replaced by:

vgpuTypeNumberBytes, err := os.ReadFile(filepath.Join(parentDevice.NvidiaPCIDevice.Path, "current_vgpu_type"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

Comment thread internal/vfio/vfio.go Outdated
Comment thread internal/vfio/vfio.go Outdated
// ParentDevice represents an NVIDIA parent PCI device.
type ParentDevice struct {
*nvpci.NvidiaPCIDevice
VirtualFunctionPath string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this field needed? Shouldn't NvidiaPCIDevice.Path represent the path to the virtual function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment. nvpci.NvidiaPCIDevice is storing the physical function

Comment thread internal/vfio/vfio.go Outdated
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 2 times, most recently from d2246a9 to 2a5d547 Compare January 20, 2026 17:59
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 3 times, most recently from cfffa9f to 6eaa198 Compare January 20, 2026 22:51
Comment thread internal/vgpu/api.go Outdated
Comment thread internal/vgpu/vfio.go Outdated
Comment thread internal/vgpu/vfio.go Outdated
Comment thread internal/vgpu/vfio.go Outdated
Comment thread internal/vgpu/vfio.go Outdated
Comment thread pkg/vgpu/config.go
Comment thread pkg/vgpu/config.go
}

type nvlibVGPUConfigManager struct {
nvlib nvlib.Interface
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does nvlib.Interface need to exist anymore?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we are no longer using the nvlib.Interface. I don't think we need it for any of the other projects either

Comment thread pkg/vgpu/config.go
if ret != nvml.SUCCESS {
continue
}
vgpuConfig[typeName]++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the name reported by NVML contains the product prefix? For example, is NVML returning NVIDIA A100-4C as the type name for the A100-4C device?

Comment thread pkg/vgpu/config.go Outdated
Comment on lines +131 to +148
found := false
for _, vgpuTypeId := range supportedVGPUs {
rawName, ret := vgpuTypeId.GetName()
if ret != nvml.SUCCESS {
continue
}
name := parseVGPUTypeName(rawName)
if name == key {
found = true
sanitizedConfig[key] = val
break
}
if name == strippedKey {
found = true
sanitizedConfig[strippedKey] = val
break
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve readability, what if we constructed a map, named supportedVgpuTypes, prior as such:

supportedVgpuTypes := map[string]bool{}
for _, vgpu := range supportedVGPUs {
  name, ret := vgpu.GetName()
  if ret != nvml.SUCCESS {
    continue
  }
  name = parseVGPUTypeName(name)
  supportedVgpuNames[name] = true
}

Then this for loop would simplify to

Suggested change
found := false
for _, vgpuTypeId := range supportedVGPUs {
rawName, ret := vgpuTypeId.GetName()
if ret != nvml.SUCCESS {
continue
}
name := parseVGPUTypeName(rawName)
if name == key {
found = true
sanitizedConfig[key] = val
break
}
if name == strippedKey {
found = true
sanitizedConfig[strippedKey] = val
break
}
}
if _, ok := supportedVgpuTypes[key]; ok {
sanitizedConfig[key] = val
} else if _, ok := supportedVgpuTypes[strippedKey]; ok {
sanitizedConfig[strippedKey] = val
} else {
return fmt.Errorf("vGPU type %s is not supported on GPU (index=%d, address=%s)", key, gpu, device.Address)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I broke this into two for loops.

Comment thread internal/vgpu/mdev.go Outdated
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 2 times, most recently from 7bc7d3d to 7e25ec0 Compare March 24, 2026 22:01
JunAr7112 added 12 commits April 1, 2026 22:31
Signed-off-by: Arjun <agadiyar@nvidia.com>
Signed-off-by: Arjun <agadiyar@nvidia.com>
Signed-off-by: Arjun <agadiyar@nvidia.com>
Signed-off-by: Arjun <agadiyar@nvidia.com>
Signed-off-by: Arjun <agadiyar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants