OCPBUGS-45112: e2e: Add irqbalance crash test with 10 pods#1493
OCPBUGS-45112: e2e: Add irqbalance crash test with 10 pods#1493oblau wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughGinkgo suite title in the IRQBalance performance test was changed and a new Tier0 test was added to verify that Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 8 | ❌ 4❌ Failed checks (3 warnings, 1 inconclusive)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: oblau The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)
400-412: Make deployment name/cleanup more flake-resistant.Using a fixed name at Line 401 and strict
Getin defer (Line 410) can create avoidable failures in reruns or partial-cleanup cases. Prefer a unique name and delete with not-found tolerance.Proposed reliability diff
@@ - dp := deployments.Make( - "test-deployment", + dpName := fmt.Sprintf("test-deployment-%d", time.Now().UnixNano()) + dp := deployments.Make( + dpName, testutils.NamespaceTesting, deployments.WithPodTemplate(testpod), deployments.WithReplicas(int32(repCount)), ) @@ Expect(testclient.DataPlaneClient.Create(context.TODO(), dp)).To(Succeed()) testlog.Infof("Created deployment %s with %d replicas (irq-load-balancing disabled)", dp.Name, repCount) defer func() { - Expect(testclient.DataPlaneClient.Get(context.TODO(), client.ObjectKeyFromObject(dp), dp)).To(Succeed()) - Expect(testclient.DataPlaneClient.Delete(context.TODO(), dp)).To(Succeed()) + err := client.IgnoreNotFound(testclient.DataPlaneClient.Delete(context.TODO(), dp)) + Expect(err).ToNot(HaveOccurred()) }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around lines 400 - 412, The deployment uses a fixed name and a strict Get in the defer which causes flakes; update the deployments.Make invocation that produces dp to use a unique name (e.g., append a random/timestamp suffix to the base "test-deployment") and simplify the defer cleanup to attempt deletion without requiring a prior Get and to tolerate NotFound errors (use client.IgnoreNotFound or check for apierrors.IsNotFound) when calling testclient.DataPlaneClient.Delete; reference the dp variable, deployments.Make call that creates the object, and the testclient.DataPlaneClient.Get/Delete invocations to locate and change the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 421-426: The current check only verifies systemd.ShowProperty(...,
"ActiveState", "irqbalance.service") after rollout which can miss transient
crashes; before starting the workload capture the restart count via
systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts",
targetNode) (store as preRestarts) and after pods are available re-query
NRestarts and assert it equals preRestarts; keep the existing ActiveState check
but add the NRestarts equality assertion using the same targetNode and
systemd.ShowProperty calls to ensure no crash+restart occurred.
---
Nitpick comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 400-412: The deployment uses a fixed name and a strict Get in the
defer which causes flakes; update the deployments.Make invocation that produces
dp to use a unique name (e.g., append a random/timestamp suffix to the base
"test-deployment") and simplify the defer cleanup to attempt deletion without
requiring a prior Get and to tolerate NotFound errors (use client.IgnoreNotFound
or check for apierrors.IsNotFound) when calling
testclient.DataPlaneClient.Delete; reference the dp variable, deployments.Make
call that creates the object, and the testclient.DataPlaneClient.Get/Delete
invocations to locate and change the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b663d0ec-2c8c-448a-923f-3fd80f47802b
📒 Files selected for processing (2)
test/e2e/performanceprofile/functests/1_performance/irqbalance.gotest/e2e/performanceprofile/functests/utils/deployments/deployments.go
f4909af to
373a658
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)
422-426:⚠️ Potential issue | 🟠 MajorFinal
ActiveStatealone still doesn't prove “no crash”.This can pass after a crash+restart cycle, so it does not fully validate OCPBUGS-45112. Capture
NRestartsbefore creating the Deployment and assert it is unchanged after the replicas become available, while keeping the finalActiveState=activecheck.Suggested hardening
annotations := map[string]string{irqLoadBalancingAnnotation: irqLoadBalancingDisable} testpod := getTestPodWithProfileAndAnnotations(profile, annotations, workloadCPUsPerPod) testpod.Spec.NodeName = targetNode.Name + + restartsBefore, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts", targetNode) + Expect(err).ToNot(HaveOccurred()) + restartsBefore = strings.TrimSpace(restartsBefore) @@ By("Verifying irqbalance.service is still active after workload is running") activeState, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "ActiveState", targetNode) Expect(err).ToNot(HaveOccurred()) testlog.Infof("irqbalance.service %s on node %s", strings.TrimSpace(activeState), targetNode.Name) Expect(strings.TrimSpace(activeState)).To(Equal("ActiveState=active"), "irqbalance must stay active while workload runs") + + restartsAfter, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts", targetNode) + Expect(err).ToNot(HaveOccurred()) + restartsAfter = strings.TrimSpace(restartsAfter) + Expect(restartsAfter).To(Equal(restartsBefore), "irqbalance restarted during workload; before=%s after=%s", restartsBefore, restartsAfter)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around lines 422 - 426, Before creating the workload capture the irqbalance restart count and later assert it didn't change: call systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts", targetNode) and store the initial value (e.g., initialNRestarts), then create the Deployment, wait for replicas to be available, re-read NRestarts with systemd.ShowProperty into finalNRestarts and Expect(finalNRestarts).To(Equal(initialNRestarts)) in addition to the existing ActiveState check that reads activeState; keep the existing log/test that verifies strings.TrimSpace(activeState) == "ActiveState=active".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 410-413: The cleanup currently calls
testclient.DataPlaneClient.Delete(dp) but returns immediately, which leaves pods
running; change the cleanup to perform a synced delete and wait for the
Deployment and its replica Pods to be removed before returning: after calling
testclient.DataPlaneClient.Delete(context.TODO(), dp) use a
foreground/propagation deletion intent (or explicitly poll Get on the Deployment
via testclient.DataPlaneClient.Get and loop until it returns NotFound), and
additionally poll/List Pods (by dp's labels or OwnerReference on Pod) until 0
replicas remain; update the deferred function that references dp and
testclient.DataPlaneClient.Delete to block until both the Deployment resource is
gone and its Pods have terminated.
---
Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 422-426: Before creating the workload capture the irqbalance
restart count and later assert it didn't change: call
systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts",
targetNode) and store the initial value (e.g., initialNRestarts), then create
the Deployment, wait for replicas to be available, re-read NRestarts with
systemd.ShowProperty into finalNRestarts and
Expect(finalNRestarts).To(Equal(initialNRestarts)) in addition to the existing
ActiveState check that reads activeState; keep the existing log/test that
verifies strings.TrimSpace(activeState) == "ActiveState=active".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: cd0c0e46-ac7c-4aec-9059-dcca0996effe
📒 Files selected for processing (2)
test/e2e/performanceprofile/functests/1_performance/irqbalance.gotest/e2e/performanceprofile/functests/utils/deployments/deployments.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/performanceprofile/functests/utils/deployments/deployments.go
373a658 to
4ad2e04
Compare
4ad2e04 to
c6a0443
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 386-394: The test title and assertion are inconsistent and the
test ID placeholder needs replacing: update the It(...) description so it
matches the exact equality check currently performed by Expect (e.g., change
"should have irqbalance StartLimitBurst >= 100" to "should have irqbalance
StartLimitBurst = 100") or alternatively change the Expect assertion in the
block using systemd.ShowProperty/ startLimitBurst to assert >= 100 if that was
intended; also replace the "[test_id:TODO]" tag with a real traceable identifier
(for example the bug or testcase ID like "test_id:OCPBUGS-45112" or your
suite-specific test case ID) so the It(...) declaration, the Expect(...) call,
and the test id are consistent (refer to the It, Expect, startLimitBurst, and
systemd.ShowProperty usages in this test).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 798091b2-d8a4-4cb5-960b-1b5483917bbc
📒 Files selected for processing (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go
c6a0443 to
2a142ce
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)
389-395:⚠️ Potential issue | 🟡 MinorTitle vs. assertion mismatch still unresolved.
Line 389 advertises
StartLimitBurst >= 100while Line 394 asserts exact equality to"StartLimitBurst=100". Either loosen the assertion to>= 100(parse thekey=valueand compare numerically) or tighten the title to= 100. Also,strings.TrimSpace(startLimitBurst)on Line 394 is redundant sincestartLimitBurstwas already trimmed on Line 392.Proposed tweak (if exact match is intended)
- It("[test_id:88711] should have irqbalance StartLimitBurst >= 100", Label(string(label.Tier0)), func() { + It("[test_id:88711] should have irqbalance StartLimitBurst = 100", Label(string(label.Tier0)), func() { startLimitBurst, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "StartLimitBurst", targetNode) Expect(err).ToNot(HaveOccurred()) startLimitBurst = strings.TrimSpace(startLimitBurst) testlog.Infof("irqbalance.service %s on node %s", startLimitBurst, targetNode.Name) - Expect(strings.TrimSpace(startLimitBurst)).To(Equal("StartLimitBurst=100"), + Expect(startLimitBurst).To(Equal("StartLimitBurst=100"), "irqbalance must have StartLimitBurst=100 (OCPBUGS-45112)") })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around lines 389 - 395, The test currently trims the property string twice and asserts exact equality while the title says ">= 100"; change the assertion to parse the returned startLimitBurst string (from systemd.ShowProperty for "irqbalance.service") by trimming once, splitting the "StartLimitBurst=VALUE" form to extract the numeric VALUE, convert it to an int, and assert VALUE >= 100; remove the redundant strings.TrimSpace call on Line 394 and update the Expect message to reflect the >= 100 requirement (or alternatively change the test title to "= 100" if you prefer exact equality).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 389-395: The test currently trims the property string twice and
asserts exact equality while the title says ">= 100"; change the assertion to
parse the returned startLimitBurst string (from systemd.ShowProperty for
"irqbalance.service") by trimming once, splitting the "StartLimitBurst=VALUE"
form to extract the numeric VALUE, convert it to an int, and assert VALUE >=
100; remove the redundant strings.TrimSpace call on Line 394 and update the
Expect message to reflect the >= 100 requirement (or alternatively change the
test title to "= 100" if you prefer exact equality).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7258a7f7-0d83-4e72-adad-2a54a354ebf2
📒 Files selected for processing (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go
|
/retest |
2a142ce to
5e9b294
Compare
| Expect(err).ToNot(HaveOccurred()) | ||
| startLimitBurst = strings.TrimSpace(startLimitBurst) | ||
| testlog.Infof("irqbalance.service %s on node %s", startLimitBurst, targetNode.Name) | ||
| Expect(strings.TrimSpace(startLimitBurst)).To(Equal("StartLimitBurst=100"), "irqbalance must have StartLimitBurst=100 (OCPBUGS-45112)") |
There was a problem hiding this comment.
It would be better to check it is 100 or more here. In case we change it in the future. That is why I did not want to depend on the config, but I agree the full e2e test is more fragile and tests systemd behavior that should be covered by systemd itself.
There was a problem hiding this comment.
Added util to see property's value using --value flag.
Now checking if the value is >= 100.
My worry is - would we not prefer this to fail if the number increases in the future?
If it increases its likely because 100 wasn't enough - so the test would not flag a system with "old" value of 100.
For example if the old test was for >= 5 we wouldn't know if the old value or the new were there.
Maybe i'm overthinking or missing something. would like to hear your thoughts.
There was a problem hiding this comment.
Well, that is an interesting point. Failing the test to make sure the test stays aligned with the code. However, we do not own the code in this case and so a cri-o change would suddently fail our tests. I am not sure what is better.
|
/verified |
|
@oblau: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by oblau |
|
@oblau: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@oblau: This pull request references Jira Issue OCPBUGS-45112, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Automating OCPBUGS-45112 - Config test CRI-O restarts irqbalance on every guaranteed pod create/delete. The default systemd StartLimitBurst=5 was too low Fix was drop-in from cri-o raising StartLimitBurst to 100. Checking for this updated value added systemd.ShowPropertyValue to utils/systemd/systemd.go Signed-off-by: oblau <oblau@redhat.com>
5e9b294 to
68c910d
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
@oblau: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This pr Is test automation for OCPBUGS-45112
The bug:
The underlying cause was the default configuration:
StartLimitIntervalUSec=10sStartLimitBurst=5meaning over 5 restarts in a 10 sec window would causeActiveState=failedEach pod create/delete with
irq-load-balancing.crio.io = "disable"annotation causes irqbalance to restart.In addition restarts coalesce meaning this is timing dependent thus not every deployment with more than 5 pods actually causes over 5 restarts.
The fix:
drop-in sets
StartLimitBurst=100.Checking this irqbalance systemd property is enough - no functional test is required
Summary by CodeRabbit