Neuron scheduler extension initializes device usage map as [true] after operator uninstall/reinstall

## Component
neuron-scheduler (Kubernetes Scheduler Extension)

## Version
2.29.94.0 (also reproduced on 2.24.23.0)

## Environment
Red Hat OpenShift (ROSA) 4.21, Kubernetes 1.34.4, inf2.8xlarge (1 Neuron device, 2 NeuronCores), neuron-device-plugin 2.29.94.0

## Description

The neuron scheduler extension initializes its `neuronDevUsageMap` with all devices marked as `[true]` (allocated) after an operator uninstall/reinstall cycle. This prevents any pod requesting `aws.amazon.com/neuron` from being scheduled via the `neuron-scheduler`.

On a fresh first-time install, the scheduler works correctly and pods are scheduled without issues. However, after uninstalling the operator (including namespace deletion) and reinstalling it on the same node, the scheduler extension marks all devices as used even though no pod is consuming them.

The device plugin pod runs on the neuron node (required by its KMM nodeSelector) but has zero `aws.amazon.com/neuron` or `aws.amazon.com/neuroncore` resource requests. The scheduler extension appears to count it as consuming a device during its initial pod sync on reinstall.

## Steps to reproduce

1. Deploy the neuron operator with DeviceConfig on a cluster with an inf2.8xlarge node
2. Schedule a pod with `schedulerName: neuron-scheduler` and `limits: aws.amazon.com/neuron: "1"` — this works
3. Wait for the pod to complete
4. Uninstall the operator (delete the DeviceConfig, CSV, namespace)
5. Reinstall the operator with the same configuration
6. Create the same pod — it stays Pending with: `No available set of device of length 1 found. Current usage: [true]`

## Evidence from scheduler extension logs

```
Pod Creating Event: neuron-device-plugin-pv29d-jj2gv
...
neuronDevUsageMap for resource:aws.amazon.com/neuron in node: ip-10-2-123-197 is [true]
available devices not found: No available set of device of length 1 found. Current usage: [true]
```

## Confirmed clean state at time of failure

- Kubelet checkpoint: `PodDeviceEntries: null` (no allocations)
- Node reports: `allocatable aws.amazon.com/neuron: 1`, `capacity: 1`
- No pod in the cluster has `aws.amazon.com/neuron` in resource requests/limits
- Device plugin logs: `enableCustomSchedulerForPartialResourcesAllocation: false`

## Additional observations

- First-time install on a fresh node works correctly
- After uninstall/reinstall on the same node, the issue appears consistently
- Terminating the EC2 instance and letting the machine pool replace it resolves the issue (fresh node)
- Restarting the scheduler extension pod does not resolve the issue
- Clearing the kubelet device checkpoint and restarting kubelet does not resolve the issue if the scheduler extension is already running
- Using the default Kubernetes scheduler with `aws.amazon.com/neuron` resource requests works correctly in all cases

## Expected behavior

The scheduler extension should correctly initialize its device usage map after an operator reinstall, marking devices as free when no pod is consuming them.

## Workaround

Remove `schedulerName: neuron-scheduler` from pod specs and use the default Kubernetes scheduler. The default scheduler correctly handles `aws.amazon.com/neuron` extended resource allocation. The neuron scheduler is only needed for topology-aware scheduling on multi-device instances (trn1.32xlarge). Alternatively, terminate the node and let the machine pool replace it with a fresh instance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neuron scheduler extension initializes device usage map as [true] after operator uninstall/reinstall #1300

Component

Version

Environment

Description

Steps to reproduce

Evidence from scheduler extension logs

Confirmed clean state at time of failure

Additional observations

Expected behavior

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Neuron scheduler extension initializes device usage map as [true] after operator uninstall/reinstall #1300

Description

Component

Version

Environment

Description

Steps to reproduce

Evidence from scheduler extension logs

Confirmed clean state at time of failure

Additional observations

Expected behavior

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions