Skip to content

[Feature] Fleet: Auto-upgrade is not retroactive and does not reconcile clusters to latest patch version #5741

@sphtd

Description

@sphtd

Is your feature request related to a problem? Please describe.

Auto-upgrade in Azure Kubernetes Fleet Manager is currently event-driven and not state-reconciliating.
If an auto-upgrade profile is created after a Kubernetes patch version has already been released, no updaterun is triggered. As a result, clusters can remain on outdated patch versions indefinitely until a new version is released.

Example:

Cluster version: 1.33.5
Available versions: 1.33.6, 1.33.7
Auto-upgrade profile created after 1.33.7 rollout
Result: no upgrade happens

Additionally:

If a cluster is stopped during maintenance window, the upgrade is missed and not retried
There is no mechanism to “catch up” to the latest allowed patch version
This leads to non-deterministic behavior and breaks the expectation of automatic patching.

Describe the solution you'd like

Auto-upgrade should behave as a desired state reconciliation system, not only event-driven.

Expected behavior:

When auto-upgrade profile is created or updated:
system evaluates current cluster version
if cluster is behind → trigger updaterun immediately
System should periodically reconcile:
if cluster version < latest allowed by policy → trigger upgrade
If upgrade fails (e.g. cluster stopped):
retry within maintenance window until successful

Optional configuration:

reconcileOnCreate: true
catchUpToLatestPatch: true
retryUntilSuccess: true
Describe alternatives you've considered

Current workarounds:
Manually triggering upgrades via CLI/API
Changing upgrade policy to force re-evaluation
Building custom automation (pipelines/scripts) to detect and upgrade outdated clusters

These approaches:
add operational overhead
defeat the purpose of “auto-upgrade”
introduce additional failure points

Additional context

Current behavior depends on:
timing of Kubernetes version rollout
timing of auto-upgrade profile creation
cluster availability during maintenance window

This makes the system non-deterministic:

two identical clusters may end up on different versions depending on timing
The system already supports manual updaterun triggering, which indicates that the capability exists, but is not used automatically for reconciliation.

Expected behavior for users:
clusters should converge to the latest allowed version defined by policy, regardless of when the profile was created or whether an event was missed.

Metadata

Metadata

Assignees

Labels

FleetAzure Kubernetes Fleet Manager, multi-cluster management.Upgradefeature-requestRequested Features

Projects

Status

Backlog

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions