Component
neuron-scheduler (Kubernetes Scheduler Extension)
Version
2.29.94.0 (also reproduced on 2.24.23.0)
Environment
Red Hat OpenShift (ROSA) 4.21, Kubernetes 1.34.4, inf2.8xlarge (1 Neuron device, 2 NeuronCores), neuron-device-plugin 2.29.94.0
Description
The neuron scheduler extension initializes its neuronDevUsageMap with all devices marked as [true] (allocated) after an operator uninstall/reinstall cycle. This prevents any pod requesting aws.amazon.com/neuron from being scheduled via the neuron-scheduler.
On a fresh first-time install, the scheduler works correctly and pods are scheduled without issues. However, after uninstalling the operator (including namespace deletion) and reinstalling it on the same node, the scheduler extension marks all devices as used even though no pod is consuming them.
The device plugin pod runs on the neuron node (required by its KMM nodeSelector) but has zero aws.amazon.com/neuron or aws.amazon.com/neuroncore resource requests. The scheduler extension appears to count it as consuming a device during its initial pod sync on reinstall.
Steps to reproduce
- Deploy the neuron operator with DeviceConfig on a cluster with an inf2.8xlarge node
- Schedule a pod with
schedulerName: neuron-scheduler and limits: aws.amazon.com/neuron: "1" — this works
- Wait for the pod to complete
- Uninstall the operator (delete the DeviceConfig, CSV, namespace)
- Reinstall the operator with the same configuration
- Create the same pod — it stays Pending with:
No available set of device of length 1 found. Current usage: [true]
Evidence from scheduler extension logs
Pod Creating Event: neuron-device-plugin-pv29d-jj2gv
...
neuronDevUsageMap for resource:aws.amazon.com/neuron in node: ip-10-2-123-197 is [true]
available devices not found: No available set of device of length 1 found. Current usage: [true]
Confirmed clean state at time of failure
- Kubelet checkpoint:
PodDeviceEntries: null (no allocations)
- Node reports:
allocatable aws.amazon.com/neuron: 1, capacity: 1
- No pod in the cluster has
aws.amazon.com/neuron in resource requests/limits
- Device plugin logs:
enableCustomSchedulerForPartialResourcesAllocation: false
Additional observations
- First-time install on a fresh node works correctly
- After uninstall/reinstall on the same node, the issue appears consistently
- Terminating the EC2 instance and letting the machine pool replace it resolves the issue (fresh node)
- Restarting the scheduler extension pod does not resolve the issue
- Clearing the kubelet device checkpoint and restarting kubelet does not resolve the issue if the scheduler extension is already running
- Using the default Kubernetes scheduler with
aws.amazon.com/neuron resource requests works correctly in all cases
Expected behavior
The scheduler extension should correctly initialize its device usage map after an operator reinstall, marking devices as free when no pod is consuming them.
Workaround
Remove schedulerName: neuron-scheduler from pod specs and use the default Kubernetes scheduler. The default scheduler correctly handles aws.amazon.com/neuron extended resource allocation. The neuron scheduler is only needed for topology-aware scheduling on multi-device instances (trn1.32xlarge). Alternatively, terminate the node and let the machine pool replace it with a fresh instance.
Component
neuron-scheduler (Kubernetes Scheduler Extension)
Version
2.29.94.0 (also reproduced on 2.24.23.0)
Environment
Red Hat OpenShift (ROSA) 4.21, Kubernetes 1.34.4, inf2.8xlarge (1 Neuron device, 2 NeuronCores), neuron-device-plugin 2.29.94.0
Description
The neuron scheduler extension initializes its
neuronDevUsageMapwith all devices marked as[true](allocated) after an operator uninstall/reinstall cycle. This prevents any pod requestingaws.amazon.com/neuronfrom being scheduled via theneuron-scheduler.On a fresh first-time install, the scheduler works correctly and pods are scheduled without issues. However, after uninstalling the operator (including namespace deletion) and reinstalling it on the same node, the scheduler extension marks all devices as used even though no pod is consuming them.
The device plugin pod runs on the neuron node (required by its KMM nodeSelector) but has zero
aws.amazon.com/neuronoraws.amazon.com/neuroncoreresource requests. The scheduler extension appears to count it as consuming a device during its initial pod sync on reinstall.Steps to reproduce
schedulerName: neuron-schedulerandlimits: aws.amazon.com/neuron: "1"— this worksNo available set of device of length 1 found. Current usage: [true]Evidence from scheduler extension logs
Confirmed clean state at time of failure
PodDeviceEntries: null(no allocations)allocatable aws.amazon.com/neuron: 1,capacity: 1aws.amazon.com/neuronin resource requests/limitsenableCustomSchedulerForPartialResourcesAllocation: falseAdditional observations
aws.amazon.com/neuronresource requests works correctly in all casesExpected behavior
The scheduler extension should correctly initialize its device usage map after an operator reinstall, marking devices as free when no pod is consuming them.
Workaround
Remove
schedulerName: neuron-schedulerfrom pod specs and use the default Kubernetes scheduler. The default scheduler correctly handlesaws.amazon.com/neuronextended resource allocation. The neuron scheduler is only needed for topology-aware scheduling on multi-device instances (trn1.32xlarge). Alternatively, terminate the node and let the machine pool replace it with a fresh instance.