HIVE-3127: Add benchmark harness for resource/remoteclient optimization#2877
HIVE-3127: Add benchmark harness for resource/remoteclient optimization#2877dlom wants to merge 1 commit intoopenshift:masterfrom
Conversation
The clustersync controller and others that depend on pkg/resource (Helper) and pkg/remoteclient have been repeatedly flagged by downstream users for: - Slow reconciliation times - Memory leaks during large SyncSet applications - Excessive API server load (unnecessary round trips and discovery calls) These packages are legacy code with significant behavioral dependencies from downstream consumers. We cannot rewrite them, but targeted optimizations (50-100 LOC changes) in specific hot paths could significantly improve performance without altering behavior. Each benchmark runs against a real kube-apiserver (via envtest) and reports custom metrics for round-trip count and byte transfer (in addition to standard allocation and wall-time measurements) Some preliminary findings: - CreateOrUpdate SteadyState allocates ~79MB/op for a no-op -- 88x more than Apply for the same 2 roundtrips. Something isn't being cached properly. - NewHelper construction costs 78MB and 205K allocs -- the dominant cost in most benchmarks. Caching the OpenAPI schema and REST mapper across operations would cut this dramatically. - First Apply per package pays 2 extra discovery roundtrips (5 vs 3) due to disk-cache cold start. Pre-warming or sharing the discovery cache would eliminate the penalty. - Hibernation takes 1 full second despite only 13 roundtrips and 1.3MB allocs. This is Wall-clock bound, likely connection/TLS overhead in the BuildKubeClient path. Assisted-by: Claude Opus 4.6
|
@dlom: This pull request references HIVE-3127 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dlom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR introduces comprehensive benchmarking infrastructure for Hive, including envtest setup utilities, benchmark orchestration harnesses, HTTP round-trip tracking, and benchmark test suites for the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This review requires careful attention to the benchmark infrastructure design ( Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| } | ||
|
|
||
| // BuilderOption configures optional behavior on a Builder. | ||
| type BuilderOption func(*builderOptions) |
There was a problem hiding this comment.
NTR: This new BuilderOption (and associated machinery) is the only new code in production. I tried to make this as non-invasive as possible: all the callsites that construct new Builders are unchanged and this code no-ops in prod
|
@dlom: This pull request references HIVE-3127 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2877 +/- ##
==========================================
- Coverage 50.31% 49.80% -0.51%
==========================================
Files 280 288 +8
Lines 34314 34666 +352
==========================================
+ Hits 17264 17266 +2
- Misses 15689 16037 +348
- Partials 1361 1363 +2
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (3)
test/benchutil/generators.go (1)
13-35: Consider de-duplicating payload/namespace defaults.
GenerateConfigMapandGenerateSecretrepeat the same padding pattern and hardcoded namespace, which is a small maintainability hotspot.♻️ Optional cleanup diff
package benchutil import ( "strings" @@ hivev1 "github.com/openshift/hive/apis/hive/v1" ) +const defaultNamespace = "default" + +func withPayload(base map[string]string, dataSize int) map[string]string { + if dataSize > 0 { + base["payload"] = strings.Repeat("x", dataSize) + } + return base +} + // GenerateConfigMap creates a ConfigMap with optional padding payload (0 = minimal). func GenerateConfigMap(name string, dataSize int) *corev1.ConfigMap { - data := map[string]string{"key": "value"} - if dataSize > 0 { - data["payload"] = strings.Repeat("x", dataSize) - } + data := withPayload(map[string]string{"key": "value"}, dataSize) return &corev1.ConfigMap{ - ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: "default"}, + ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: defaultNamespace}, Data: data, } } @@ func GenerateSecret(name string, dataSize int) *corev1.Secret { - data := map[string]string{"password": "secret"} - if dataSize > 0 { - data["payload"] = strings.Repeat("x", dataSize) - } + data := withPayload(map[string]string{"password": "secret"}, dataSize) return &corev1.Secret{ - ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: "default"}, + ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: defaultNamespace}, StringData: data, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchutil/generators.go` around lines 13 - 35, GenerateConfigMap and GenerateSecret duplicate the padding payload logic and hardcoded namespace; extract a small helper (e.g., buildPayloadMap(payloadSize int, base map[string]string) map[string]string) to return the base map plus optional "payload" and introduce a single defaultNamespace constant (e.g., defaultNamespace = "default") and/or a helper for ObjectMeta (e.g., newDefaultMeta(name string) metav1.ObjectMeta), then update GenerateConfigMap to call buildPayloadMap for Data and newDefaultMeta for ObjectMeta and update GenerateSecret to call the same buildPayloadMap for StringData and reuse newDefaultMeta; keep function signatures the same.test/benchutil/namespace.go (2)
35-37: Add a timeout to namespace creation calls.At Line 35,
context.Background()can hang benchmarks indefinitely if the API server stalls.💡 Proposed fix
import ( "context" "fmt" "sync" "sync/atomic" "testing" + "time" @@ - _, err := benchClientset.CoreV1().Namespaces().Create(context.Background(), &corev1.Namespace{ + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + _, err := benchClientset.CoreV1().Namespaces().Create(ctx, &corev1.Namespace{ ObjectMeta: metav1.ObjectMeta{Name: name}, }, metav1.CreateOptions{})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchutil/namespace.go` around lines 35 - 37, Replace the use of context.Background() in the Namespace creation call (benchClientset.CoreV1().Namespaces().Create) with a cancellable context that has a deadline—use context.WithTimeout to create ctx, defer cancel(), and pass that ctx to Create so the call times out instead of hanging indefinitely; choose an appropriate timeout value for benchmarks and ensure cancel() is deferred immediately after creating the context to avoid leaks.
24-33:sync.Onceignores latercfgvalues.At Line 26, the first config wins permanently; later calls with different
cfgvalues will still use the original clientset.💡 Guard against config mismatch
var ( benchNSCounter atomic.Int64 benchClientset *kubernetes.Clientset + benchClientHost string benchClientOnce sync.Once ) @@ benchClientOnce.Do(func() { cs, err := kubernetes.NewForConfig(cfg) if err != nil { b.Fatalf("failed to create clientset: %v", err) } benchClientset = cs + benchClientHost = cfg.Host }) + if cfg.Host != benchClientHost { + b.Fatalf("BenchNamespace called with different config host: first=%s current=%s", benchClientHost, cfg.Host) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchutil/namespace.go` around lines 24 - 33, BenchNamespace currently uses benchClientOnce so the first cfg wins forever; replace the sync.Once pattern with a small guarded init that allows different cfg values: add a package-level mutex (e.g., benchClientMu) and a stored lastCfg (e.g., benchClientCfg *rest.Config), then in BenchNamespace acquire the mutex, if benchClientset is nil or the new cfg differs from benchClientCfg (compare a stable identifier like cfg.Host or use reflect.DeepEqual), create a new kubernetes client via kubernetes.NewForConfig(cfg), assign benchClientset and benchClientCfg, and release the mutex; remove benchClientOnce usage so subsequent calls with different cfg will recreate the clientset.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hack/setup-envtest.sh`:
- Line 17: The ENVTEST_PATH assignment is capturing stderr into the variable
(ENVTEST_PATH=$(setup-envtest use "$K8S_VERSION" -p path 2>&1)), which pollutes
the path used later for KUBEBUILDER_ASSETS; remove the redirection so only
stdout is captured (ENVTEST_PATH=$(setup-envtest use "$K8S_VERSION" -p path)).
Locate the ENVTEST_PATH assignment and the subsequent export of
KUBEBUILDER_ASSETS and ensure no stderr is merged into ENVTEST_PATH; if you need
error output, redirect errors separately or check the command exit status and
log errors to stderr rather than into ENVTEST_PATH.
In `@Makefile`:
- Around line 393-397: Add the missing .PHONY declarations for the Makefile
targets "clean" and "test" to avoid checkmake warnings; similar to the existing
.PHONY: setup-envtest line, update the .PHONY list (or add a new .PHONY line) to
include clean and test so the Makefile treats those targets as phony rather than
files.
In `@test/benchmark/controller_sim_benchmark_test.go`:
- Around line 278-317: The benchmark leaves created MachineSets behind, causing
MachineSetList to grow across iterations; modify the Reconcile function to clean
up the created resource after the iteration by deleting the MachineSet you
created (use the toCreate or workerName with
remoteClient.Delete(context.Background(), toCreate) or remoteClient.Delete by
object key), or alternatively ensure fully isolated state per iteration by
rebuilding a fresh namespace/state before each iteration (use
s.NewBuilder().Build per-iteration and tear down afterwards). Target the
Reconcile closure, the toCreate variable and workerName, and add explicit
cleanup (remoteClient.Delete) or per-iteration state teardown to avoid
accumulating MachineSets between runs.
- Around line 435-442: The benchmark BenchmarkControllerReconcileUnreachable
currently calls brc.NewBuilder().UsePrimaryAPIURL().Build() which performs
network I/O and will fail the entire benchmark via b.Fatalf when the endpoint is
unreachable; change the setup to simulate an unreachable API by configuring the
builder with an APIURLOverride to a non-routable address (or otherwise inject an
unreachable URL) and do not call b.Fatalf on the Build() error—instead treat the
connection error as expected (e.g., verify the error type/contains expected text
and continue to exercise the reconcile path or skip the build failure), or
alternatively separate client construction from reconciliation measurement so
the reconcile logic is benchmarked independently of Build() failures; update
BenchmarkControllerReconcileUnreachable and the brc.NewBuilder() usage
accordingly.
In `@test/benchutil/client.go`:
- Around line 61-73: The newClient function creates an httpClient for
apiutil.NewDynamicRESTMapper but does not pass that same client into client.New,
causing a second HTTP client to be created and splitting connection/TLS state;
update newClient so the client.Options passed to client.New includes HTTPClient:
httpClient (in addition to Scheme and Mapper) so the same *http.Client is reused
by apiutil.NewDynamicRESTMapper and client.New.
In `@test/benchutil/harness.go`:
- Around line 27-33: The benchmark misreports FirstApply when SteadyState is
true but NewObjects returns a fixed set (so iterations after 0 are already
steady-state); update run (the function that orchestrates benchmark phases) to
ensure true "first-apply" behavior by either requiring per-iteration unique
objects or by recreating the environment and object slice for each iteration
during the first-apply phase: detect when SteadyState is true and NewObjects
returns a fixed-length slice (i.e., NewObjects does not depend on i/unique
naming), and in that case invoke Setup and NewObjects inside the b.N loop (or
otherwise reset the BenchEnv and re-create objects each iteration) before
calling Reconcile(b, state, objects, i); use the existing symbols SteadyState,
NewObjects, Setup, Reconcile and run to locate and implement this change so
FirstApply truly applies fresh objects for every iteration.
In `@test/benchutil/namespace.go`:
- Around line 26-31: Replace the abrupt os.Exit(1) in the benchClientOnce
initialization with a benchmark-friendly failure: when
kubernetes.NewForConfig(cfg) returns an error, call the benchmark's b.Fatalf (or
pass the testing.B into the setup) instead of fmt.Fprintf/os.Exit so the
benchmark harness reports the failure correctly; update the initialization flow
around benchClientOnce and the NewForConfig error handling to propagate or call
b.Fatalf in the error branch.
---
Nitpick comments:
In `@test/benchutil/generators.go`:
- Around line 13-35: GenerateConfigMap and GenerateSecret duplicate the padding
payload logic and hardcoded namespace; extract a small helper (e.g.,
buildPayloadMap(payloadSize int, base map[string]string) map[string]string) to
return the base map plus optional "payload" and introduce a single
defaultNamespace constant (e.g., defaultNamespace = "default") and/or a helper
for ObjectMeta (e.g., newDefaultMeta(name string) metav1.ObjectMeta), then
update GenerateConfigMap to call buildPayloadMap for Data and newDefaultMeta for
ObjectMeta and update GenerateSecret to call the same buildPayloadMap for
StringData and reuse newDefaultMeta; keep function signatures the same.
In `@test/benchutil/namespace.go`:
- Around line 35-37: Replace the use of context.Background() in the Namespace
creation call (benchClientset.CoreV1().Namespaces().Create) with a cancellable
context that has a deadline—use context.WithTimeout to create ctx, defer
cancel(), and pass that ctx to Create so the call times out instead of hanging
indefinitely; choose an appropriate timeout value for benchmarks and ensure
cancel() is deferred immediately after creating the context to avoid leaks.
- Around line 24-33: BenchNamespace currently uses benchClientOnce so the first
cfg wins forever; replace the sync.Once pattern with a small guarded init that
allows different cfg values: add a package-level mutex (e.g., benchClientMu) and
a stored lastCfg (e.g., benchClientCfg *rest.Config), then in BenchNamespace
acquire the mutex, if benchClientset is nil or the new cfg differs from
benchClientCfg (compare a stable identifier like cfg.Host or use
reflect.DeepEqual), create a new kubernetes client via
kubernetes.NewForConfig(cfg), assign benchClientset and benchClientCfg, and
release the mutex; remove benchClientOnce usage so subsequent calls with
different cfg will recreate the clientset.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f92bb83f-09df-47df-9e6b-5b69f051a7bb
📒 Files selected for processing (60)
Makefilego.modhack/benchmark-comparison.shhack/setup-envtest.shpkg/dependencymagnet/doc.gopkg/remoteclient/benchmark_test.gopkg/remoteclient/kubeconfig.gopkg/remoteclient/remoteclient.gopkg/remoteclient/remoteclient_benchmark_test.gopkg/resource/benchmark_test.gopkg/resource/helper_benchmark_test.gotest/benchmark/README.mdtest/benchmark/benchmark_test.gotest/benchmark/controller_sim_benchmark_test.gotest/benchutil/benchenv.gotest/benchutil/client.gotest/benchutil/doc.gotest/benchutil/envtest.gotest/benchutil/generators.gotest/benchutil/harness.gotest/benchutil/helpers.gotest/benchutil/namespace.gotest/benchutil/roundtrip.govendor/modules.txtvendor/sigs.k8s.io/controller-runtime/pkg/envtest/crd.govendor/sigs.k8s.io/controller-runtime/pkg/envtest/doc.govendor/sigs.k8s.io/controller-runtime/pkg/envtest/helper.govendor/sigs.k8s.io/controller-runtime/pkg/envtest/server.govendor/sigs.k8s.io/controller-runtime/pkg/envtest/webhook.govendor/sigs.k8s.io/controller-runtime/pkg/internal/flock/doc.govendor/sigs.k8s.io/controller-runtime/pkg/internal/flock/errors.govendor/sigs.k8s.io/controller-runtime/pkg/internal/flock/flock_other.govendor/sigs.k8s.io/controller-runtime/pkg/internal/flock/flock_unix.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/addr/manager.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/certs/tinyca.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/controlplane/apiserver.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/controlplane/auth.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/controlplane/etcd.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/controlplane/kubectl.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/controlplane/plane.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/process/arguments.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/process/bin_path_finder.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/process/procattr_other.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/process/procattr_unix.govendor/sigs.k8s.io/controller-runtime/pkg/internal/testing/process/process.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/LICENSEvendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/README.mdvendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/env/env.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/env/exit.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/env/helpers.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/main.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/remote/client.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/remote/http_client.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/remote/read_body.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/store/helpers.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/store/store.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/versions/parse.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/versions/platform.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/versions/version.govendor/sigs.k8s.io/controller-runtime/tools/setup-envtest/workflows/workflows.go
| echo "using k8s version ${K8S_VERSION}" | ||
|
|
||
| echo "Fetching binaries (on first run this may take some time)" | ||
| ENVTEST_PATH=$(setup-envtest use "$K8S_VERSION" -p path 2>&1) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and read the target file
find . -name "setup-envtest.sh" -type fRepository: openshift/hive
Length of output: 81
🏁 Script executed:
# Read the file to see context around line 17
cat -n hack/setup-envtest.sh | head -40Repository: openshift/hive
Length of output: 782
🏁 Script executed:
# Check how ENVTEST_PATH is used in the file
grep -n "ENVTEST_PATH" hack/setup-envtest.shRepository: openshift/hive
Length of output: 178
🏁 Script executed:
# Check for KUBEBUILDER_ASSETS references
grep -n "KUBEBUILDER_ASSETS" hack/setup-envtest.shRepository: openshift/hive
Length of output: 113
Remove 2>&1 from ENVTEST_PATH capture to avoid stderr pollution.
At Line 17, 2>&1 captures both stderr and stdout into the variable. Any progress or error output from setup-envtest will be mixed into ENVTEST_PATH, yielding an invalid path when exported as KUBEBUILDER_ASSETS on Line 21.
💡 Proposed fix
-ENVTEST_PATH=$(setup-envtest use "$K8S_VERSION" -p path 2>&1)
+ENVTEST_PATH="$(setup-envtest use "$K8S_VERSION" -p path)"
+if [[ ! -d "$ENVTEST_PATH" ]]; then
+ echo "setup-envtest returned a non-directory path: $ENVTEST_PATH" >&2
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ENVTEST_PATH=$(setup-envtest use "$K8S_VERSION" -p path 2>&1) | |
| ENVTEST_PATH="$(setup-envtest use "$K8S_VERSION" -p path)" | |
| if [[ ! -d "$ENVTEST_PATH" ]]; then | |
| echo "setup-envtest returned a non-directory path: $ENVTEST_PATH" >&2 | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/setup-envtest.sh` at line 17, The ENVTEST_PATH assignment is capturing
stderr into the variable (ENVTEST_PATH=$(setup-envtest use "$K8S_VERSION" -p
path 2>&1)), which pollutes the path used later for KUBEBUILDER_ASSETS; remove
the redirection so only stdout is captured (ENVTEST_PATH=$(setup-envtest use
"$K8S_VERSION" -p path)). Locate the ENVTEST_PATH assignment and the subsequent
export of KUBEBUILDER_ASSETS and ensure no stderr is merged into ENVTEST_PATH;
if you need error output, redirect errors separately or check the command exit
status and log errors to stderr rather than into ENVTEST_PATH.
| .PHONY: setup-envtest | ||
| # Downloads envtest binaries and prints the required export KUBEBUILDER_ASSETS=... command. | ||
| # Run the printed export command (or add it to your shell rc) before running benchmarks. | ||
| setup-envtest: | ||
| hack/setup-envtest.sh |
There was a problem hiding this comment.
Add missing .PHONY declarations for clean and test.
At Line [393], static analysis reports missing required phony targets, which can cause avoidable checkmake warnings.
🔧 Suggested fix
-.PHONY: all
+.PHONY: all clean test🧰 Tools
🪛 checkmake (0.2.2)
[warning] 393-393: Missing required phony target "clean"
(minphony)
[warning] 393-393: Missing required phony target "test"
(minphony)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Makefile` around lines 393 - 397, Add the missing .PHONY declarations for the
Makefile targets "clean" and "test" to avoid checkmake warnings; similar to the
existing .PHONY: setup-envtest line, update the .PHONY list (or add a new .PHONY
line) to include clean and test so the Makefile treats those targets as phony
rather than files.
| Reconcile: func(b *testing.B, s *firstApplyState, _ []client.Object, i int) { | ||
| remoteClient, err := s.NewBuilder().Build() | ||
| if err != nil { | ||
| b.Fatalf("Build failed: %v", err) | ||
| } | ||
| workerName := fmt.Sprintf("worker-%d", i) | ||
|
|
||
| ms := &machinev1beta1.MachineSet{} | ||
| if err := remoteClient.Get(context.Background(), client.ObjectKey{ | ||
| Namespace: s.ns, Name: workerName, | ||
| }, ms); err != nil && !apierrors.IsNotFound(err) { | ||
| b.Fatalf("Get MachineSet failed: %v", err) | ||
| } | ||
|
|
||
| machineList := &machinev1beta1.MachineList{} | ||
| if err := remoteClient.List(context.Background(), machineList, client.InNamespace(s.ns)); err != nil { | ||
| b.Fatalf("List Machines failed: %v", err) | ||
| } | ||
|
|
||
| machineSetList := &machinev1beta1.MachineSetList{} | ||
| if err := remoteClient.List(context.Background(), machineSetList, client.InNamespace(s.ns)); err != nil { | ||
| b.Fatalf("List MachineSets failed: %v", err) | ||
| } | ||
|
|
||
| replicas := int32(3) | ||
| toCreate := &machinev1beta1.MachineSet{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: workerName, | ||
| Namespace: s.ns, | ||
| }, | ||
| Spec: machinev1beta1.MachineSetSpec{ | ||
| Replicas: &replicas, | ||
| Selector: metav1.LabelSelector{ | ||
| MatchLabels: map[string]string{"machine.openshift.io/cluster-api-machineset": workerName}, | ||
| }, | ||
| }, | ||
| } | ||
| if err := remoteClient.Create(context.Background(), toCreate); err != nil { | ||
| b.Fatalf("Create failed: %v", err) | ||
| } |
There was a problem hiding this comment.
FirstApply keeps inflating the MachineSetList response.
Each iteration creates a new worker-%d MachineSet and leaves it behind, but the benchmark still lists all MachineSets in the namespace before the create. That makes later iterations more expensive than earlier ones, so the reported per-op numbers depend on b.N instead of a stable first-apply workload. Recreate isolated state per iteration, or remove the created MachineSet before the next loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchmark/controller_sim_benchmark_test.go` around lines 278 - 317, The
benchmark leaves created MachineSets behind, causing MachineSetList to grow
across iterations; modify the Reconcile function to clean up the created
resource after the iteration by deleting the MachineSet you created (use the
toCreate or workerName with remoteClient.Delete(context.Background(), toCreate)
or remoteClient.Delete by object key), or alternatively ensure fully isolated
state per iteration by rebuilding a fresh namespace/state before each iteration
(use s.NewBuilder().Build per-iteration and tear down afterwards). Target the
Reconcile closure, the toCreate variable and workerName, and add explicit
cleanup (remoteClient.Delete) or per-iteration state teardown to avoid
accumulating MachineSets between runs.
| func BenchmarkControllerReconcileUnreachable(b *testing.B) { | ||
| benchutil.ControllerHarness[*benchutil.BenchRemoteClient]{ | ||
| Setup: benchutil.SetupRemoteClient, | ||
| Reconcile: func(b *testing.B, brc *benchutil.BenchRemoteClient, _ []client.Object, _ int) { | ||
| if _, err := brc.NewBuilder().UsePrimaryAPIURL().Build(); err != nil { | ||
| b.Fatalf("Build failed: %v", err) | ||
| } | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify whether the remoteclient builder does any network I/O before the first request.
rg -nP -C6 --type go 'func\s+\([^)]*\)\s+UsePrimaryAPIURL\s*\(' test/benchutil pkg/remoteclient
rg -nP -C10 --type go 'func\s+\([^)]*\)\s+Build\s*\(' test/benchutil pkg/remoteclient
rg -nP -C10 --type go 'func\s+\([^)]*\)\s+RESTConfig\s*\(' test/benchutil pkg/remoteclient
rg -nP -C10 --type go 'func\s+\([^)]*\)\s+BuildKubeClient\s*\(' test/benchutil pkg/remoteclientRepository: openshift/hive
Length of output: 21597
🏁 Script executed:
# Check the benchmark setup and BenchRemoteClient
rg -nA20 'type BenchRemoteClient' test/benchutil
rg -nA15 'func SetupRemoteClient' test/benchutil
rg -nB5 'BenchmarkControllerReconcileUnreachable' test/benchmark/controller_sim_benchmark_test.goRepository: openshift/hive
Length of output: 2540
🏁 Script executed:
# Look at ControllerHarness structure and how it invokes Reconcile
rg -nA30 'type ControllerHarness' test/benchutil
rg -nA50 'func.*Run.*ControllerHarness' test/benchutil/harness.goRepository: openshift/hive
Length of output: 2241
🏁 Script executed:
# Get more of the run function to see how Reconcile is invoked
sed -n '47,80p' test/benchutil/harness.goRepository: openshift/hive
Length of output: 895
🏁 Script executed:
# Check how NewBuilder works and what URLs are configured
rg -nB10 -A10 'func.*NewBuilder' test/benchutil
rg -nB5 -A15 'type BenchEnv' test/benchutilRepository: openshift/hive
Length of output: 2674
🏁 Script executed:
# Check how primaryURL and URLs are used in remoteclient
rg -nB5 -A15 'primaryURL|secondaryURL|urlToUse' pkg/remoteclient/remoteclient.go | head -60Repository: openshift/hive
Length of output: 1762
🏁 Script executed:
# Check how the URLs are constructed in RESTConfig and what they represent
rg -nB10 -A20 'func.*unadulteratedRESTConfig' pkg/remoteclient/remoteclient.go
rg -nB5 -A15 'urlToUse.*==' pkg/remoteclient/remoteclient.goRepository: openshift/hive
Length of output: 1829
🏁 Script executed:
# Check what's in the BenchEnv's ClusterDeployment
rg -nB5 -A30 'func NewBenchEnv' test/benchutilRepository: openshift/hive
Length of output: 2138
🏁 Script executed:
# Continue reading NewBenchEnv to see the ClusterDeployment setup
sed -n '37,130p' test/benchutil/benchenv.goRepository: openshift/hive
Length of output: 1610
🏁 Script executed:
# Look at the full BenchmarkControllerReconcileUnreachable code to understand intent
sed -n '433,450p' test/benchmark/controller_sim_benchmark_test.goRepository: openshift/hive
Length of output: 797
🏁 Script executed:
# Look at other benchmark patterns to understand the intent and compare
sed -n '380,445p' test/benchmark/controller_sim_benchmark_test.goRepository: openshift/hive
Length of output: 2330
🏁 Script executed:
# Check if there's an APIURLOverride set anywhere in the BenchEnv setup
rg -n 'APIURLOverride|APIServerIPOverride' test/benchutil
# Also check if the benchmark is trying to test an unreachable endpoint
rg -nB2 -A2 'unreachable\|Unreachable' test/benchmark/controller_sim_benchmark_test.goRepository: openshift/hive
Length of output: 40
Build() performs network I/O immediately and will fail on an unreachable endpoint, aborting the benchmark instead of measuring reconciliation behavior.
Build() is not lazy—it performs reachability verification via discovery.NewDiscoveryClientForConfig() and restmapper.GetAPIGroupResources(), so any connection error occurs during Build(), not later. When Build() fails for an unreachable endpoint, b.Fatalf() aborts the benchmark run rather than allowing measurement of reconciliation behavior under degraded conditions. Additionally, the benchmark's comment claims to "model unreachable," but the setup does not configure an unreachable endpoint (no APIURLOverride is set), so the benchmark connects to the actual working envtest server.
To benchmark reconciliation against an unreachable endpoint, the benchmark should either set an APIURLOverride to a non-existent address and treat the resulting connection error as expected (avoiding b.Fatalf()), or restructure to measure reconciliation logic separately from client-building failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchmark/controller_sim_benchmark_test.go` around lines 435 - 442, The
benchmark BenchmarkControllerReconcileUnreachable currently calls
brc.NewBuilder().UsePrimaryAPIURL().Build() which performs network I/O and will
fail the entire benchmark via b.Fatalf when the endpoint is unreachable; change
the setup to simulate an unreachable API by configuring the builder with an
APIURLOverride to a non-routable address (or otherwise inject an unreachable
URL) and do not call b.Fatalf on the Build() error—instead treat the connection
error as expected (e.g., verify the error type/contains expected text and
continue to exercise the reconcile path or skip the build failure), or
alternatively separate client construction from reconciliation measurement so
the reconcile logic is benchmarked independently of Build() failures; update
BenchmarkControllerReconcileUnreachable and the brc.NewBuilder() usage
accordingly.
| func newClient(cfg *rest.Config) (client.Client, error) { | ||
| httpClient, err := rest.HTTPClientFor(cfg) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create HTTP client: %w", err) | ||
| } | ||
| mapper, err := apiutil.NewDynamicRESTMapper(cfg, httpClient) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create REST mapper: %w", err) | ||
| } | ||
| return client.New(cfg, client.Options{ | ||
| Scheme: hivescheme.GetScheme(), | ||
| Mapper: mapper, | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the client package and understand the structure
find . -type f -name "*.go" | grep -E "(client|benchmark)" | head -20Repository: openshift/hive
Length of output: 870
🏁 Script executed:
# Look for the client.Options definition
rg "type Options struct" -A 15 --type goRepository: openshift/hive
Length of output: 50371
🏁 Script executed:
# Check the client.New function signature
rg "func New\(" -B 2 -A 10 --type go | head -50Repository: openshift/hive
Length of output: 4736
🏁 Script executed:
# Check the actual file content
head -100 test/benchutil/client.goRepository: openshift/hive
Length of output: 2273
🏁 Script executed:
# Find and examine client.New function
rg "func New\(" vendor/sigs.k8s.io/controller-runtime/pkg/client/ -A 20 --type goRepository: openshift/hive
Length of output: 1949
🏁 Script executed:
# Get more of the newClient implementation
rg "func newClient\(config \*rest.Config, options Options\)" vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go -A 50 --type goRepository: openshift/hive
Length of output: 1418
Reuse the same http.Client in client.New.
The mapper uses the provided httpClient, but client.New creates a second HTTP client since HTTPClient is not set in the options. This splits connection pooling and TLS state, skewing benchmark results for cold-start wall time.
🔧 Suggested change
return client.New(cfg, client.Options{
Scheme: hivescheme.GetScheme(),
Mapper: mapper,
+ HTTPClient: httpClient,
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func newClient(cfg *rest.Config) (client.Client, error) { | |
| httpClient, err := rest.HTTPClientFor(cfg) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create HTTP client: %w", err) | |
| } | |
| mapper, err := apiutil.NewDynamicRESTMapper(cfg, httpClient) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create REST mapper: %w", err) | |
| } | |
| return client.New(cfg, client.Options{ | |
| Scheme: hivescheme.GetScheme(), | |
| Mapper: mapper, | |
| }) | |
| func newClient(cfg *rest.Config) (client.Client, error) { | |
| httpClient, err := rest.HTTPClientFor(cfg) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create HTTP client: %w", err) | |
| } | |
| mapper, err := apiutil.NewDynamicRESTMapper(cfg, httpClient) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create REST mapper: %w", err) | |
| } | |
| return client.New(cfg, client.Options{ | |
| Scheme: hivescheme.GetScheme(), | |
| Mapper: mapper, | |
| HTTPClient: httpClient, | |
| }) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchutil/client.go` around lines 61 - 73, The newClient function
creates an httpClient for apiutil.NewDynamicRESTMapper but does not pass that
same client into client.New, causing a second HTTP client to be created and
splitting connection/TLS state; update newClient so the client.Options passed to
client.New includes HTTPClient: httpClient (in addition to Scheme and Mapper) so
the same *http.Client is reused by apiutil.NewDynamicRESTMapper and client.New.
| if os.Getenv("KUBEBUILDER_ASSETS") == "" { | ||
| fmt.Fprintf(os.Stderr, ` | ||
| ERROR: KUBEBUILDER_ASSETS environment variable not set | ||
|
|
||
| Benchmarks require envtest binaries (kube-apiserver, etcd). | ||
|
|
||
| To set up: | ||
| $ make setup-envtest | ||
|
|
||
| Then export the path shown, or run: | ||
| $ export KUBEBUILDER_ASSETS="$(hack/setup-envtest.sh | tail -1 | cut -d'=' -f2 | tr -d '"')" | ||
|
|
||
| See test/benchmark/README.md for details. | ||
|
|
||
| `) | ||
| os.Exit(1) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and locate the file
find . -name "envtest.go" -path "*/benchutil/*" 2>/dev/null | head -20Repository: openshift/hive
Length of output: 85
🏁 Script executed:
# Get an overview of the file
wc -l test/benchutil/envtest.goRepository: openshift/hive
Length of output: 87
🏁 Script executed:
# Read the full file to understand the context
cat -n test/benchutil/envtest.goRepository: openshift/hive
Length of output: 5920
Don't call os.Exit from the shared envtest bootstrap.
If testEnv.Start() succeeds (line 71) and a later step fails (lines 76-80), these exits bypass the normal cleanup path and leave the envtest child processes behind. Please change the function signature to return an error, then have callers fail via testing.TB methods instead.
This applies to all os.Exit calls within EnsureEnvTest and hiveCRDPaths (lines 42-57, 62-80, 96-105).
| NewObjects func(b *testing.B, ns string) []client.Object // optional; creates objects to reconcile | ||
| Setup func(b *testing.B, env *BenchEnv) S // creates per-benchmark state | ||
| // Reconcile runs one iteration. objects is the slice returned by NewObjects (nil if NewObjects is nil). | ||
| // When NewObjects returns b.N uniquely-named objects (e.g. via SingleTemplate), use objects[i]. | ||
| // When NewObjects returns a fixed set to apply every iteration, iterate over the whole slice and ignore i. | ||
| Reconcile func(b *testing.B, state S, objects []client.Object, i int) | ||
| SteadyState bool // when true, also run with pre-seeded objects |
There was a problem hiding this comment.
SteadyState plus a fixed object set makes FirstApply numbers misleading.
run(b, false) still reuses one env and one object slice across the full b.N loop. For benchmarks that return fixed-name objects and ignore i, only iteration 0 is a real first apply; the remaining b.N-1 iterations are already steady-state. That currently skews the FirstApply results in BenchmarkControllerReconcileClusterSync*. Either require per-iteration unique objects for first-apply runs (like SingleTemplate) or recreate/reset the env between iterations in the first-apply phase.
Also applies to: 39-41, 47-68
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchutil/harness.go` around lines 27 - 33, The benchmark misreports
FirstApply when SteadyState is true but NewObjects returns a fixed set (so
iterations after 0 are already steady-state); update run (the function that
orchestrates benchmark phases) to ensure true "first-apply" behavior by either
requiring per-iteration unique objects or by recreating the environment and
object slice for each iteration during the first-apply phase: detect when
SteadyState is true and NewObjects returns a fixed-length slice (i.e.,
NewObjects does not depend on i/unique naming), and in that case invoke Setup
and NewObjects inside the b.N loop (or otherwise reset the BenchEnv and
re-create objects each iteration) before calling Reconcile(b, state, objects,
i); use the existing symbols SteadyState, NewObjects, Setup, Reconcile and run
to locate and implement this change so FirstApply truly applies fresh objects
for every iteration.
| benchClientOnce.Do(func() { | ||
| cs, err := kubernetes.NewForConfig(cfg) | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "failed to create clientset: %v\n", err) | ||
| os.Exit(1) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n test/benchutil/namespace.goRepository: openshift/hive
Length of output: 1340
Avoid os.Exit(1) in benchmark code paths.
At line 30, this exits the entire test binary abruptly and bypasses normal benchmark failure handling; use b.Fatalf() instead to properly fail the benchmark.
💡 Proposed fix
import (
"context"
"fmt"
- "os"
"sync"
"sync/atomic"
"testing"
@@
benchClientOnce.Do(func() {
cs, err := kubernetes.NewForConfig(cfg)
if err != nil {
- fmt.Fprintf(os.Stderr, "failed to create clientset: %v\n", err)
- os.Exit(1)
+ b.Fatalf("failed to create clientset: %v", err)
}
benchClientset = cs
})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchutil/namespace.go` around lines 26 - 31, Replace the abrupt
os.Exit(1) in the benchClientOnce initialization with a benchmark-friendly
failure: when kubernetes.NewForConfig(cfg) returns an error, call the
benchmark's b.Fatalf (or pass the testing.B into the setup) instead of
fmt.Fprintf/os.Exit so the benchmark harness reports the failure correctly;
update the initialization flow around benchClientOnce and the NewForConfig error
handling to propagate or call b.Fatalf in the error branch.
|
@dlom: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
2uasimojo
left a comment
There was a problem hiding this comment.
This is really nice, Mark. Most of my comments are in the form of wishes :)
LMK when you're ready for another pass.
| # Ensure KUBEBUILDER_ASSETS is set | ||
| if [ -z "$KUBEBUILDER_ASSETS" ]; then | ||
| echo "Error: KUBEBUILDER_ASSETS environment variable not set" | ||
| echo "" | ||
| echo "Run: make setup-envtest" | ||
| echo "Then: export KUBEBUILDER_ASSETS=\"\$(hack/setup-envtest.sh | tail -1 | cut -d'=' -f2 | tr -d '\"')\"" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
What would be really cool is if the benchmark could optionally be run against an actual cluster. I've been trying to track down where/how KUBEBUILDER_ASSETS is actually consumed to spoof the cluster via envtest, but have so far been unsuccessful.
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if b.opts.transportWrapper != nil { |
There was a problem hiding this comment.
This seems like it should be DRYable if, instead of func(*builderOptions), our BuilderOptions (possibly renamed to RESTConfigOptions) was func(*whatever_type_cfg_is). Unless that type is different in the different calls?
| {"ConfigMap", benchutil.GenerateConfigMap("apply-cm", 0)}, | ||
| {"Secret", benchutil.GenerateSecret("apply-secret", 0)}, | ||
| {"Deployment", benchutil.GenerateDeployment("apply-deploy")}, | ||
| {"ServiceAccount", benchutil.GenerateServiceAccount("apply-sa")}, |
There was a problem hiding this comment.
I wonder if it would be instructive to include some OpenShift-specific types?
| {"ConfigMap", benchutil.GenerateConfigMap("rt-cm", 0)}, | ||
| {"Secret", benchutil.GenerateSecret("rt-secret", 0)}, | ||
| {"ServiceAccount", benchutil.GenerateServiceAccount("rt-sa")}, |
There was a problem hiding this comment.
Curious why the list of resources isn't the same across the board.
In fact, seems like we could DRY the list via a wrapper that accepts the string prefix (here rt).
| {"100B", 100}, | ||
| {"1KB", 1024}, | ||
| {"10KB", 10 * 1024}, | ||
| {"100KB", 100 * 1024}, |
There was a problem hiding this comment.
similar DRY comment for sizes
| Key insight: Round-trip count changes indicate behavioral changes. | ||
| If your optimization changes round-trip count, it's not just a performance | ||
| improvement -- it's a logic change that needs careful review. |
There was a problem hiding this comment.
Perhaps the comparison script could flag these visually somehow.
| Benchmarks run against a real kube-apiserver + etcd (via controller-runtime's envtest), | ||
| not mocks, giving accurate round-trip and byte measurements. Each benchmark uses |
There was a problem hiding this comment.
Yeah, no. envtest is not a "real" KAS. I'm still not clear whether it has object storage, but at a minimum, it's local, so the round-trip times are certainly not "accurate". I suppose it might be set up with sleeps to simulate network latency/throughput? Anyway, suggest softening this language a bit.
As mentioned earlier, a goal would be the ability to run the suite against an actual cluster, so that we could get accurate RT measurements.
(Later) Oh, perhaps you meant a count of the number of round trips?
| // NewBenchEnv creates a fully initialized benchmark environment. | ||
| func NewBenchEnv(b *testing.B) *BenchEnv { | ||
| b.Helper() | ||
| cfg := EnsureEnvTest(b) |
There was a problem hiding this comment.
Yeah, yeah, so here, couldn't we optionally load up a real KUBECONFIG instead? And the rest would Just Work™?
| func hiveCRDPaths() []string { | ||
| out, err := exec.Command("go", "list", "-m", "-f", "{{.Dir}}", "github.com/openshift/hive").Output() | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "failed to locate module root: %v\n", err) | ||
| os.Exit(1) | ||
| } | ||
| root := strings.TrimSpace(string(out)) | ||
| crdDir := filepath.Join(root, "config", "crds") | ||
| if _, err := os.Stat(crdDir); err != nil { | ||
| fmt.Fprintf(os.Stderr, "CRD directory not found at %s: %v\n", crdDir, err) | ||
| os.Exit(1) | ||
| } | ||
| return []string{crdDir} | ||
| } |
There was a problem hiding this comment.
This seems unnecessarily obfuscated. It's not really adding anything to just having the list of dirs hardcoded in a const at the top.
| k8s.io/utils v0.0.0-20251002143259-bc988d571ff4 | ||
| sigs.k8s.io/cluster-api-provider-azure v1.21.1-0.20250929163617-2c4eaa611a39 | ||
| sigs.k8s.io/controller-runtime v0.22.3 | ||
| sigs.k8s.io/controller-runtime/tools/setup-envtest v0.0.0-20240927101401-4381fa0aeee4 |
There was a problem hiding this comment.
Odd that this didn't produce a go.sum delta?
xref: HIVE-3127
The clustersync controller and others that depend on pkg/resource (Helper) and pkg/remoteclient have been repeatedly flagged by downstream users for:
These packages are legacy code with significant behavioral dependencies from downstream consumers. We cannot rewrite them, but targeted optimizations (50-100 LOC changes) in specific hot paths could significantly improve performance without altering behavior.
Each benchmark runs against a real kube-apiserver (via envtest) and reports custom metrics for round-trip count and byte transfer (in addition to standard allocation and wall-time measurements)
Some preliminary findings:
Assisted-by: Claude Opus 4.6 (LLM wrote the original suite of benchmarks automatically, I have refined them extensively to meet my quality standards)
/assign @2uasimojo
Summary by CodeRabbit
Release Notes
New Features
Refactor