Skip to content

Track preserved failed machines in MCS and MCD status#1092

Merged
gardener-prow[bot] merged 12 commits intogardener:masterfrom
thiyyakat:mcd-preservation
Apr 27, 2026
Merged

Track preserved failed machines in MCS and MCD status#1092
gardener-prow[bot] merged 12 commits intogardener:masterfrom
thiyyakat:mcd-preservation

Conversation

@thiyyakat
Copy link
Copy Markdown
Member

@thiyyakat thiyyakat commented Apr 6, 2026

What this PR does / why we need it:
This PR introduces 2 changes:

  1. It tracks the number of preserved failed machines in the MCSs and MCDs in a new field preservedFailedReplicas added to the MCS and MCD Status.
  2. It prevents the MCD from being marked as unhealthy by excluding preserved failed machines from failedMachine in the MCS and MCD Status. This will also prevent shoot reconciliation from getting stuck.

Additionally, the PR modifies the sorting logic for ActiveMachines -- if two machines are preserved, one auto-preserved and one explicitly preserved through annotation by user/operator, the sorting logic de-prioritizes explicitly preserved machines for deletion.

The usage doc for preservation has been updated with a warning regarding the behaviour of DWD when number of preserved failed machines exceeds the threshold set for DWD.

Which issue(s) this PR fixes:
Extends #1008

Special notes for your reviewer:
MCM unit and integration tests passed with changes.
The changes were manually tested out using virtual provider. Below log shows MCD status when it has 1 preserved failed replica.

status:
  conditions:
  - lastTransitionTime: "2026-04-06T07:20:38Z"
    lastUpdateTime: "2026-04-06T07:20:58Z"
    message: MachineSet "shoot--i749592--preserve-pr-preserve-pr-cpu-z1-5cc55" has
      successfully progressed.
    reason: NewMachineSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2026-04-06T11:27:16Z"
    lastUpdateTime: "2026-04-06T11:27:16Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  preservedFailedReplicas: 1
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

Release note:

Track preserved failed machines in field `preservedFailedReplicas` in the Status of MCS and MCD, and prevent these machines from resulting in MCD being marked as unhealthy.

@thiyyakat thiyyakat requested a review from a team as a code owner April 6, 2026 11:22
@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 6, 2026
Comment thread pkg/controller/deployment_machineset_util.go Outdated
Comment thread pkg/controller/deployment_machineset_util.go Outdated
Comment thread pkg/controller/deployment_machineset_util.go Outdated
Comment thread pkg/controller/deployment_machineset_util.go
Comment thread pkg/controller/deployment_util.go Outdated
return totalAvailableReplicas
}

// GetPreservedFailedReplicaCountForMachineSets returns the number of available machines corresponding to the given machine sets.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring needs to be corrected

Comment thread pkg/controller/deployment_sync.go Outdated
Comment thread pkg/apis/machine/v1alpha1/machinedeployment_types.go
@gardener-prow gardener-prow Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
@thiyyakat thiyyakat force-pushed the mcd-preservation branch 2 times, most recently from 3ef4b96 to a47f9a5 Compare April 14, 2026 13:20
@gardener-prow gardener-prow Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 16, 2026
@thiyyakat thiyyakat requested a review from gagan16k April 16, 2026 04:58
@thiyyakat thiyyakat added the kind/enhancement Enhancement, improvement, extension label Apr 17, 2026
@gardener-prow gardener-prow Bot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Apr 17, 2026
Comment thread pkg/controller/controller_utils.go Outdated
@gardener gardener deleted a comment from thiyyakat Apr 21, 2026
@thiyyakat thiyyakat requested a review from r4mek April 22, 2026 03:50
@r4mek
Copy link
Copy Markdown
Contributor

r4mek commented Apr 22, 2026

/lgtm

@gardener-prow gardener-prow Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2026
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Apr 22, 2026

LGTM label has been added.

DetailsGit tree hash: b6085293a87bf4863ad9b9ac74d7330449a41f91

@gardener-prow gardener-prow Bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm Indicates that a PR is ready to be merged. labels Apr 23, 2026
@gardener-prow gardener-prow Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2026
@gardener-prow gardener-prow Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2026
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Apr 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: takoverflow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Apr 27, 2026

LGTM label has been added.

DetailsGit tree hash: a15d532b63e1e166b69f04c18388a3251ced2c0d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants