Skip to content

HIVE-2391: vsphere zonal#2851

Open
2uasimojo wants to merge 2 commits intoopenshift:masterfrom
2uasimojo:HIVE-2391/vsphere-zonal
Open

HIVE-2391: vsphere zonal#2851
2uasimojo wants to merge 2 commits intoopenshift:masterfrom
2uasimojo:HIVE-2391/vsphere-zonal

Conversation

@2uasimojo
Copy link
Copy Markdown
Member

@2uasimojo 2uasimojo commented Feb 10, 2026

Co-Authored-By: @dlom

Summary by CodeRabbit

  • New Features

    • Multi-vCenter support and new Infrastructure-based vSphere platform (including richer failure-domain/topology and machine-pool disk/zone options).
    • CLI: support for an installer platform JSON input and multi-vCenter flags.
  • Deprecations

    • Legacy per-field vSphere settings (vCenter, datacenter, datastore, folder, cluster, network, etc.) deprecated in favor of Infrastructure/VCenters.
  • Documentation

    • Updated docs and examples, secret format now supports a vcenters list and migration guidance included.

@openshift-ci openshift-ci bot requested review from dlom and suhanime February 10, 2026 22:10
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Feb 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026
@2uasimojo 2uasimojo changed the title Hive 2391: vsphere zonal HIVE2391: vsphere zonal Feb 10, 2026
@2uasimojo 2uasimojo changed the title HIVE2391: vsphere zonal HIVE-2391: vsphere zonal Feb 10, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 10, 2026

@2uasimojo: This pull request references HIVE-2391 which is a valid jira issue.

Details

In response to this:

Co-Authored-By: @dlom

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 26.71480% with 203 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.24%. Comparing base (910c602) to head (d3042d8).

Files with missing lines Patch % Lines
contrib/pkg/createcluster/create.go 0.00% 78 Missing ⚠️
pkg/controller/utils/credentials.go 0.00% 33 Missing ⚠️
.../clusterdeployment/clusterdeployment_controller.go 0.00% 20 Missing and 1 partial ⚠️
contrib/pkg/deprovision/vsphere.go 0.00% 15 Missing ⚠️
pkg/creds/vsphere/vsphere.go 0.00% 15 Missing ⚠️
pkg/installmanager/installmanager.go 0.00% 13 Missing ⚠️
...g/controller/clusterpool/clusterpool_controller.go 0.00% 11 Missing and 1 partial ⚠️
pkg/install/generate.go 0.00% 6 Missing ⚠️
pkg/clusterresource/vsphere.go 83.33% 4 Missing ⚠️
pkg/controller/utils/vsphereutils/vsphere.go 77.77% 2 Missing and 2 partials ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2851      +/-   ##
==========================================
- Coverage   50.31%   50.24%   -0.07%     
==========================================
  Files         280      281       +1     
  Lines       34314    34347      +33     
==========================================
- Hits        17264    17258       -6     
- Misses      15689    15733      +44     
+ Partials     1361     1356       -5     
Files with missing lines Coverage Δ
pkg/constants/constants.go 100.00% <ø> (ø)
...oller/clusterdeployment/installconfigvalidation.go 100.00% <100.00%> (ø)
...g/controller/machinepool/machinepool_controller.go 64.33% <100.00%> (+0.52%) ⬆️
pkg/controller/machinepool/vsphereactuator.go 77.14% <100.00%> (+3.57%) ⬆️
...hift/hive/apis/hive/v1/clusterdeprovision_types.go 0.00% <ø> (ø)
.../v1/clusterdeployment_validating_admission_hook.go 86.29% <87.50%> (+1.56%) ⬆️
pkg/clusterresource/vsphere.go 92.95% <83.33%> (+2.83%) ⬆️
pkg/controller/utils/vsphereutils/vsphere.go 77.77% <77.77%> (ø)
pkg/install/generate.go 45.56% <0.00%> (-0.29%) ⬇️
...g/controller/clusterpool/clusterpool_controller.go 58.83% <0.00%> (-0.22%) ⬇️
... and 6 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 29af27f to 5249404 Compare February 11, 2026 18:20
@2uasimojo
Copy link
Copy Markdown
Member Author

@jianping-shu this passed e2e-vsphere, so I reckon it's probably ready for you to take another stab at it!

@2uasimojo
Copy link
Copy Markdown
Member Author

/hold

Looks like I missed refactoring the preflight auth check for the new creds shape.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2026
@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 5249404 to 39cf13e Compare February 13, 2026 20:21
@2uasimojo
Copy link
Copy Markdown
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2026
@dlom
Copy link
Copy Markdown
Contributor

dlom commented Feb 18, 2026

The new multi-creds changes LGTM. My only concern (as noted in review comments) is that there are some additional fields (at least Folder, potentially more) on the machinepool that end users may want to use

@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 39cf13e to 896e851 Compare February 19, 2026 22:49
@2uasimojo
Copy link
Copy Markdown
Member Author

/hold for QE

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2026
@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 896e851 to 7996e22 Compare February 26, 2026 15:21
@2uasimojo
Copy link
Copy Markdown
Member Author

2uasimojo commented Feb 26, 2026

/test e2e security

e2e: infra flake
security: upstream bug with the packageURL check again (though possibly slightly different this time).

@2uasimojo
Copy link
Copy Markdown
Member Author

Allowable to override security job (buggy upstream tool) if necessary when ready to merge.

@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 7996e22 to 066a4e4 Compare February 27, 2026 16:46
@2uasimojo
Copy link
Copy Markdown
Member Author

I had a try on the latest commit and image

Well, I thought I had fixed it for hiveutil and clusterpool with this, but, rookie mistake, I'm assigning into the loop variable which is a local copy 🙄

I'm going to fix this... but your comment also made me think we should also try to prevent the user from handing us a CD with passwords populated. It's a bit awkward because we can't modify the schema to exclude those fields, or even add a docstring indicating they should not be used, without making our own copy. The best we can do is a webhook, which should at least prevent the sensitive data from getting into etcd.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026
@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 183ef5e to de00368 Compare March 23, 2026 18:09
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026
@2uasimojo
Copy link
Copy Markdown
Member Author

This one is just a rebase. Fixes incoming...

@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from de00368 to afded4e Compare March 23, 2026 18:56
@2uasimojo
Copy link
Copy Markdown
Member Author

Okay @jianping-shu, this rev should have fixed the following:

  • It should be impossible for you to get credentials into the CD or ClusterPool manifest produced by hiveutil with -o yaml when providing the creds via env vars.
  • If you manually inject the creds in the CD/ClusterPool before attempting to create the CR, you should be bounced by webhook.
  • Providing creds in the json object via hiveutil should hit one of the above two blockers -- offhand I'm not sure which.

@jianping-shu
Copy link
Copy Markdown
Contributor

@2uasimojo Here is the test result for today.

It should be impossible for you to get credentials into the CD or ClusterPool manifest produced by hiveutil with -o yaml when providing the creds via env vars.

hiveutil doesn't support ClusterPool for vsphere/openstack so far so no issue here.
I tested and updated Case 3 - hiveutil of OCP-84265, basically the password is reset to "" for the applicable cases.
Considering both username and password are REDACTED in metadata.json, do you think if we shall reset user to "" too?

If you manually inject the creds in the CD/ClusterPool before attempting to create the CR, you should be bounced by webhook.

Tested w/ Case 3a - CD/CP yaml shall not contain password of OCP-84265,
when CD.spec has non-null password(single vcenter or multiple), the yaml is denied by webhook logic.
But for CP.spec has non-null password, webhook logic didn't work and yaml was still applied to cluster successfully.

@2uasimojo
Copy link
Copy Markdown
Member Author

But for CP.spec has non-null password, webhook logic didn't work and yaml was still applied to cluster successfully.

Well that's pretty weird. The call stack should be:
Validate() calls
validateCreate() calls
validateClusterPoolPlatform() calls
validatePlatformConfiguration() calls
my shiny new code.

I even added unit tests, which pass as written, and fail correctly when I comment out the new code.

I'll see if I can reproduce.

@2uasimojo
Copy link
Copy Markdown
Member Author

I'll see if I can reproduce.

Yup. Reproduced no problem. It, uh, turns out we're not installing the VWHC for clusterpool at all 😬

I wonder how long that's been the case, and how much of our validation has drifted broken in the interim.

I'll track this down and get back to you. Tempted to say we should fix it in a separate PR, but stay tuned.

@2uasimojo
Copy link
Copy Markdown
Member Author

I'll track this down and get back to you. Tempted to say we should fix it in a separate PR, but stay tuned.

Yeah, it looks like we might never have had this piece, which means we've not been validating ClusterPools, ever.

I've opened https://redhat.atlassian.net/browse/HIVE-3131 for this. I would indeed like to track this as a separate effort due to the risk: since we haven't been running this code (other than in unit tests!) there is a chance it will reject some things it shouldn't, regressing existing consumers' ability to use ClusterPools. So while the code change won't be hard, the test effort needs to be somewhat comprehensive.

Unfortunately, since the issue in question here is important for security, I think it would be prudent to block this PR until the above is resolved.

}
metadata.VSphere.VCenters = vcenters
}
creds.ConfigureCreds[utils.GetClusterPlatform(cd)](dynClient, nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this nil should be metadata, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@jianping-shu this should have failed the cleanupFailedProvision test?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ever tested the following case with the commit before last Fri. The CD was deleted successfully finally.
OCP-87278 Case 3- Old hive version, install vsphere CD; Start to delete CD(hung due to cred failure), upgrade hive to zonal version and delete the CD successfully with default destroy

map[string][]byte{
constants.UsernameSecretKey: []byte(vsphereUsername),
constants.PasswordSecretKey: []byte(vspherePassword),
"vCenters": vcenterCredsb,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creds/vsphere/vsphere.go and controller/utils/credentials.go both looks for this key at "vcenters", not "vCenters"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well how the heck were they ever working??

Good eye.

Comment on lines +113 to +115
if !cdVCenters.IsSuperset(credsVCenters) {
return false, fmt.Errorf("missing VSphere credentials for some configured VCenters: %q", sets.List(credsVCenters.Difference(cdVCenters)))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !cdVCenters.IsSuperset(credsVCenters) {
return false, fmt.Errorf("missing VSphere credentials for some configured VCenters: %q", sets.List(credsVCenters.Difference(cdVCenters)))
}
if !credsVCenters.IsSuperset(cdVCenters) {
return false, fmt.Errorf("missing VSphere credentials for some configured VCenters: %q", sets.List(cdVCenters.Difference(credsVCenters)))
}

I think these sets are backwards. As it stands, this fires when creds has vcenters not present on the CD

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are.

@@ -1325,15 +1342,7 @@ func (r *ReconcileClusterPool) createCloudBuilder(pool *hivev1.ClusterPool, logg
return nil, err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to return a new err (errors.New("vsphere certificates secret missing '.cacert' key")) rather than the existing err (which is nil in this codepath always thanks to the if block immediately above)

}
// Scrub credentials
i := b.infrastructure.DeepCopy()
i.DeprecatedPassword = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on clearing usernames here too? (both the deprecated path and in the array of VCenters)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jianping brought up the same thing. I wasn't concerned about the username being sensitive; but I suppose for consistency with what we're doing elsewhere, it makes sense.

Oh, I just thought of a better reason: I think in some places we're only setting the username if it's unset. So there's a potential failure mode if the username is present but incorrect. (Though arguably that's a user error...)

// DeprecatedVCenter is the vSphere vCenter hostname.
// Deprecated: use VCenters instead.
// +optional
DeprecatedVCenter string `json:"vCenter"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs omitempty in the json tag (and VCenters too)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since neither is a pointer type and both have +optional, omitempty has no effect. I don't mind including the tag though.

}

for _, test := range tests {
t.Run(test.Name, func(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth it to test this more vigorously, I bet the LLM could generate tons of theoretical test cases

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little meh about testing upstream functionality. Having a smoke test like this to make sure we're calling upstream correctly (or at all) is sane, but trying to cover all its paths seems like the responsibility of the repo in which it lives, no?

@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from afded4e to 870fdd5 Compare March 24, 2026 20:26
@2uasimojo
Copy link
Copy Markdown
Member Author

Rebase only. Fixes pending.

Followon addressing review from openshift#2731.

MachinePool:
- Removed `Topology` override
- Restored ResourcePool and TagIDs overrides
- Removed `osImage` detected from arbitrary master; using whatever's
  passed through from FD Topology (which defaults sanely if unset).

Deprovision:
- Changed `--vsphere-vcenter` StringVar to `--vsphere-vcenters`
  StringSliceVar

Platform Creds:
- Redesigned to take `vcenters`, a list of vcenter
  server/username/password, matching (and unmarshaling into) the
  corresponding chunk of metadata.json.

Docs:
- Updated install-config sample to zonal shape.
- Documented new creds shape.
@2uasimojo 2uasimojo force-pushed the HIVE-2391/vsphere-zonal branch from 870fdd5 to d3042d8 Compare March 24, 2026 21:17
@2uasimojo
Copy link
Copy Markdown
Member Author

Okay, fixes are in. Thanks for the review @dlom

@jianping-shu
Copy link
Copy Markdown
Contributor

regression test in progress...

@2uasimojo
Copy link
Copy Markdown
Member Author

/test e2e-vsphere

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 25, 2026

@2uasimojo: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@2uasimojo
Copy link
Copy Markdown
Member Author

Unfortunately, since the issue in question here is important for security, I think it would be prudent to block this PR until the above is resolved.

That's merged.

Getting real close here. Currently just waiting on:

@jianping-shu
Copy link
Copy Markdown
Contributor

jianping-shu commented Apr 1, 2026

@dlom I've finished the round of regression tests.
Here are 2 findings:

  1. For vsphere credential secret, data.vCenters works but data.vcenters is NOT working.
    (1)data.vcenters worked for the previous commit but didn't work for the latest commit
    (2) For using-hive.md, the cmd sample still uses 'vcenters'
    oc create secret generic mycluster-vsphere-creds -n mynamespace --from-file=vcenters=/home/me/vsphere/vcenters.yaml
    (3)For the vsphere credential secret generated by hiveutil, it uses 'vCenters'
- apiVersion: v1
  data:
    password: aaa
    username: bbb
    vCenters: ccc
  1. The admission validation doesn't work for clusterpool, this is OK since HIVE-3131 change is NOT included in this PR yet.

I think we have 2 alternatives:
(A) Wait for Eric to review and address the issue 1
(B) If we expected to merge the PR sooner, then it is possible to merge the PR now and create one follow up PR to change the command is using-hive.md from 'vcenters' to 'vCenters' as temporary solution.
Pls. check and comment

cc @newtonheath @2uasimojo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants