Skip to content

Neuron scheduler extension initializes device usage map as [true] after operator uninstall/reinstall #1300

@shapirov103

Description

@shapirov103

Component

neuron-scheduler (Kubernetes Scheduler Extension)

Version

2.29.94.0 (also reproduced on 2.24.23.0)

Environment

Red Hat OpenShift (ROSA) 4.21, Kubernetes 1.34.4, inf2.8xlarge (1 Neuron device, 2 NeuronCores), neuron-device-plugin 2.29.94.0

Description

The neuron scheduler extension initializes its neuronDevUsageMap with all devices marked as [true] (allocated) after an operator uninstall/reinstall cycle. This prevents any pod requesting aws.amazon.com/neuron from being scheduled via the neuron-scheduler.

On a fresh first-time install, the scheduler works correctly and pods are scheduled without issues. However, after uninstalling the operator (including namespace deletion) and reinstalling it on the same node, the scheduler extension marks all devices as used even though no pod is consuming them.

The device plugin pod runs on the neuron node (required by its KMM nodeSelector) but has zero aws.amazon.com/neuron or aws.amazon.com/neuroncore resource requests. The scheduler extension appears to count it as consuming a device during its initial pod sync on reinstall.

Steps to reproduce

  1. Deploy the neuron operator with DeviceConfig on a cluster with an inf2.8xlarge node
  2. Schedule a pod with schedulerName: neuron-scheduler and limits: aws.amazon.com/neuron: "1" — this works
  3. Wait for the pod to complete
  4. Uninstall the operator (delete the DeviceConfig, CSV, namespace)
  5. Reinstall the operator with the same configuration
  6. Create the same pod — it stays Pending with: No available set of device of length 1 found. Current usage: [true]

Evidence from scheduler extension logs

Pod Creating Event: neuron-device-plugin-pv29d-jj2gv
...
neuronDevUsageMap for resource:aws.amazon.com/neuron in node: ip-10-2-123-197 is [true]
available devices not found: No available set of device of length 1 found. Current usage: [true]

Confirmed clean state at time of failure

  • Kubelet checkpoint: PodDeviceEntries: null (no allocations)
  • Node reports: allocatable aws.amazon.com/neuron: 1, capacity: 1
  • No pod in the cluster has aws.amazon.com/neuron in resource requests/limits
  • Device plugin logs: enableCustomSchedulerForPartialResourcesAllocation: false

Additional observations

  • First-time install on a fresh node works correctly
  • After uninstall/reinstall on the same node, the issue appears consistently
  • Terminating the EC2 instance and letting the machine pool replace it resolves the issue (fresh node)
  • Restarting the scheduler extension pod does not resolve the issue
  • Clearing the kubelet device checkpoint and restarting kubelet does not resolve the issue if the scheduler extension is already running
  • Using the default Kubernetes scheduler with aws.amazon.com/neuron resource requests works correctly in all cases

Expected behavior

The scheduler extension should correctly initialize its device usage map after an operator reinstall, marking devices as free when no pod is consuming them.

Workaround

Remove schedulerName: neuron-scheduler from pod specs and use the default Kubernetes scheduler. The default scheduler correctly handles aws.amazon.com/neuron extended resource allocation. The neuron scheduler is only needed for topology-aware scheduling on multi-device instances (trn1.32xlarge). Alternatively, terminate the node and let the machine pool replace it with a fresh instance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions