Skip to content

OCPBUGS-45112: e2e: Add irqbalance crash test with 10 pods#1493

Open
oblau wants to merge 1 commit intoopenshift:mainfrom
oblau:automation/10-pods-crash-irqbalance
Open

OCPBUGS-45112: e2e: Add irqbalance crash test with 10 pods#1493
oblau wants to merge 1 commit intoopenshift:mainfrom
oblau:automation/10-pods-crash-irqbalance

Conversation

@oblau
Copy link
Copy Markdown
Member

@oblau oblau commented Apr 14, 2026

This pr Is test automation for OCPBUGS-45112

The bug:
The underlying cause was the default configuration:
StartLimitIntervalUSec=10s StartLimitBurst=5 meaning over 5 restarts in a 10 sec window would cause ActiveState=failed
Each pod create/delete with irq-load-balancing.crio.io = "disable" annotation causes irqbalance to restart.
In addition restarts coalesce meaning this is timing dependent thus not every deployment with more than 5 pods actually causes over 5 restarts.

The fix:
drop-in sets StartLimitBurst=100.
Checking this irqbalance systemd property is enough - no functional test is required

Summary by CodeRabbit

  • Tests
    • Renamed a performance test suite for clearer identification of IRQBalance checks, improving test readability.
    • Added a new Tier0 end-to-end test that verifies the IRQBalance service startup limit burst is exactly 100, increasing coverage for system startup configuration and catching regressions early.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Ginkgo suite title in the IRQBalance performance test was changed and a new Tier0 test was added to verify that irqbalance.service has a StartLimitBurst systemd property set exactly to StartLimitBurst=100.

Changes

Cohort / File(s) Summary
IRQBalance Performance Test
test/e2e/performanceprofile/functests/1_performance/irqbalance.go
Updated suite title from "[performance] Checking IRQBalance settings" to "[performance] IRQBalance". Added a Tier0 test ([test_id:88711]) that calls systemd.ShowProperty on irqbalance.service, trims whitespace, logs the value, asserts no lookup error, and verifies the property equals StartLimitBurst=100.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 4

❌ Failed checks (3 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Line 391 contains Expect(err).ToNot(HaveOccurred()) without a meaningful failure message, while similar assertions throughout the file include diagnostic messages. Add descriptive message to line 391 assertion: Expect(err).ToNot(HaveOccurred(), "failed to get irqbalance.service StartLimitBurst property from node %q", targetNode.Name)
Title check ⚠️ Warning The PR title mentions a crash test with 10 pods, but the actual changes only rename the test suite description and add a systemd property validation check for irqbalance.service StartLimitBurst, without implementing a 10-pod deployment test. Update the title to accurately reflect the changes: 'e2e: Add irqbalance StartLimitBurst systemd property validation' or similar, or ensure the PR implementation matches the stated 10-pods crash test objective.
Microshift Test Compatibility ❓ Inconclusive Test file path test/e2e/performanceprofile/functests/1_performance/irqbalance.go could not be located in the repository to verify MicroShift API compatibility. Provide access to the test file content or verify repository state to assess whether the test uses only standard Kubernetes APIs or includes MicroShift-incompatible resources.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles in the irqbalance.go file are stable and deterministic with no dynamic information that could change between runs.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new test queries systemd properties on a single target node without requiring multiple nodes, pod scheduling, or cluster failover, making it SNO compatible.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-unfriendly scheduling constraints introduced. Test uses only simple node pinning via spec.NodeName without pod anti-affinity, topology spread constraints, or control-plane selectors.
Ote Binary Stdout Contract ✅ Passed The pull request does not violate the OTE Binary Stdout Contract. New test code is contained within an It() block where stdout is intercepted by Ginkgo framework. All logging uses testlog package writing to ginkgo.GinkgoWriter rather than directly to stdout. Helper functions with testlog calls are only invoked at test runtime from within test blocks, not at module initialization. Module-level code only instantiates Kubernetes clients without producing stdout output.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The test file contains no IPv4-specific assumptions or external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from MarSik and swatisehgal April 14, 2026 15:19
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: oblau
Once this PR has been reviewed and has the lgtm label, please assign marsik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)

400-412: Make deployment name/cleanup more flake-resistant.

Using a fixed name at Line 401 and strict Get in defer (Line 410) can create avoidable failures in reruns or partial-cleanup cases. Prefer a unique name and delete with not-found tolerance.

Proposed reliability diff
@@
-		dp := deployments.Make(
-			"test-deployment",
+		dpName := fmt.Sprintf("test-deployment-%d", time.Now().UnixNano())
+		dp := deployments.Make(
+			dpName,
 			testutils.NamespaceTesting,
 			deployments.WithPodTemplate(testpod),
 			deployments.WithReplicas(int32(repCount)),
 		)
@@
 		Expect(testclient.DataPlaneClient.Create(context.TODO(), dp)).To(Succeed())
 		testlog.Infof("Created deployment %s with %d replicas (irq-load-balancing disabled)", dp.Name, repCount)
 		defer func() {
-			Expect(testclient.DataPlaneClient.Get(context.TODO(), client.ObjectKeyFromObject(dp), dp)).To(Succeed())
-			Expect(testclient.DataPlaneClient.Delete(context.TODO(), dp)).To(Succeed())
+			err := client.IgnoreNotFound(testclient.DataPlaneClient.Delete(context.TODO(), dp))
+			Expect(err).ToNot(HaveOccurred())
 		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around
lines 400 - 412, The deployment uses a fixed name and a strict Get in the defer
which causes flakes; update the deployments.Make invocation that produces dp to
use a unique name (e.g., append a random/timestamp suffix to the base
"test-deployment") and simplify the defer cleanup to attempt deletion without
requiring a prior Get and to tolerate NotFound errors (use client.IgnoreNotFound
or check for apierrors.IsNotFound) when calling
testclient.DataPlaneClient.Delete; reference the dp variable, deployments.Make
call that creates the object, and the testclient.DataPlaneClient.Get/Delete
invocations to locate and change the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 421-426: The current check only verifies systemd.ShowProperty(...,
"ActiveState", "irqbalance.service") after rollout which can miss transient
crashes; before starting the workload capture the restart count via
systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts",
targetNode) (store as preRestarts) and after pods are available re-query
NRestarts and assert it equals preRestarts; keep the existing ActiveState check
but add the NRestarts equality assertion using the same targetNode and
systemd.ShowProperty calls to ensure no crash+restart occurred.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 400-412: The deployment uses a fixed name and a strict Get in the
defer which causes flakes; update the deployments.Make invocation that produces
dp to use a unique name (e.g., append a random/timestamp suffix to the base
"test-deployment") and simplify the defer cleanup to attempt deletion without
requiring a prior Get and to tolerate NotFound errors (use client.IgnoreNotFound
or check for apierrors.IsNotFound) when calling
testclient.DataPlaneClient.Delete; reference the dp variable, deployments.Make
call that creates the object, and the testclient.DataPlaneClient.Get/Delete
invocations to locate and change the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b663d0ec-2c8c-448a-923f-3fd80f47802b

📥 Commits

Reviewing files that changed from the base of the PR and between 3d98f7e and f4909af.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/utils/deployments/deployments.go

Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
@oblau oblau force-pushed the automation/10-pods-crash-irqbalance branch from f4909af to 373a658 Compare April 15, 2026 08:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)

422-426: ⚠️ Potential issue | 🟠 Major

Final ActiveState alone still doesn't prove “no crash”.

This can pass after a crash+restart cycle, so it does not fully validate OCPBUGS-45112. Capture NRestarts before creating the Deployment and assert it is unchanged after the replicas become available, while keeping the final ActiveState=active check.

Suggested hardening
 		annotations := map[string]string{irqLoadBalancingAnnotation: irqLoadBalancingDisable}
 		testpod := getTestPodWithProfileAndAnnotations(profile, annotations, workloadCPUsPerPod)
 		testpod.Spec.NodeName = targetNode.Name
+
+		restartsBefore, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts", targetNode)
+		Expect(err).ToNot(HaveOccurred())
+		restartsBefore = strings.TrimSpace(restartsBefore)
@@
 		By("Verifying irqbalance.service is still active after workload is running")
 		activeState, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "ActiveState", targetNode)
 		Expect(err).ToNot(HaveOccurred())
 		testlog.Infof("irqbalance.service %s on node %s", strings.TrimSpace(activeState), targetNode.Name)
 		Expect(strings.TrimSpace(activeState)).To(Equal("ActiveState=active"), "irqbalance must stay active while workload runs")
+
+		restartsAfter, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts", targetNode)
+		Expect(err).ToNot(HaveOccurred())
+		restartsAfter = strings.TrimSpace(restartsAfter)
+		Expect(restartsAfter).To(Equal(restartsBefore), "irqbalance restarted during workload; before=%s after=%s", restartsBefore, restartsAfter)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around
lines 422 - 426, Before creating the workload capture the irqbalance restart
count and later assert it didn't change: call
systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts",
targetNode) and store the initial value (e.g., initialNRestarts), then create
the Deployment, wait for replicas to be available, re-read NRestarts with
systemd.ShowProperty into finalNRestarts and
Expect(finalNRestarts).To(Equal(initialNRestarts)) in addition to the existing
ActiveState check that reads activeState; keep the existing log/test that
verifies strings.TrimSpace(activeState) == "ActiveState=active".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 410-413: The cleanup currently calls
testclient.DataPlaneClient.Delete(dp) but returns immediately, which leaves pods
running; change the cleanup to perform a synced delete and wait for the
Deployment and its replica Pods to be removed before returning: after calling
testclient.DataPlaneClient.Delete(context.TODO(), dp) use a
foreground/propagation deletion intent (or explicitly poll Get on the Deployment
via testclient.DataPlaneClient.Get and loop until it returns NotFound), and
additionally poll/List Pods (by dp's labels or OwnerReference on Pod) until 0
replicas remain; update the deferred function that references dp and
testclient.DataPlaneClient.Delete to block until both the Deployment resource is
gone and its Pods have terminated.

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 422-426: Before creating the workload capture the irqbalance
restart count and later assert it didn't change: call
systemd.ShowProperty(context.TODO(), "irqbalance.service", "NRestarts",
targetNode) and store the initial value (e.g., initialNRestarts), then create
the Deployment, wait for replicas to be available, re-read NRestarts with
systemd.ShowProperty into finalNRestarts and
Expect(finalNRestarts).To(Equal(initialNRestarts)) in addition to the existing
ActiveState check that reads activeState; keep the existing log/test that
verifies strings.TrimSpace(activeState) == "ActiveState=active".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: cd0c0e46-ac7c-4aec-9059-dcca0996effe

📥 Commits

Reviewing files that changed from the base of the PR and between f4909af and 373a658.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/utils/deployments/deployments.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/performanceprofile/functests/utils/deployments/deployments.go

Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
@oblau oblau force-pushed the automation/10-pods-crash-irqbalance branch from 373a658 to 4ad2e04 Compare April 15, 2026 12:20
Comment thread test/e2e/performanceprofile/functests/utils/deployments/deployments.go Outdated
Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
@oblau oblau force-pushed the automation/10-pods-crash-irqbalance branch from 4ad2e04 to c6a0443 Compare April 20, 2026 15:11
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 386-394: The test title and assertion are inconsistent and the
test ID placeholder needs replacing: update the It(...) description so it
matches the exact equality check currently performed by Expect (e.g., change
"should have irqbalance StartLimitBurst >= 100" to "should have irqbalance
StartLimitBurst = 100") or alternatively change the Expect assertion in the
block using systemd.ShowProperty/ startLimitBurst to assert >= 100 if that was
intended; also replace the "[test_id:TODO]" tag with a real traceable identifier
(for example the bug or testcase ID like "test_id:OCPBUGS-45112" or your
suite-specific test case ID) so the It(...) declaration, the Expect(...) call,
and the test id are consistent (refer to the It, Expect, startLimitBurst, and
systemd.ShowProperty usages in this test).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 798091b2-d8a4-4cb5-960b-1b5483917bbc

📥 Commits

Reviewing files that changed from the base of the PR and between 373a658 and c6a0443.

📒 Files selected for processing (1)
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go

Comment thread test/e2e/performanceprofile/functests/1_performance/irqbalance.go Outdated
@oblau oblau force-pushed the automation/10-pods-crash-irqbalance branch from c6a0443 to 2a142ce Compare April 20, 2026 16:16
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)

389-395: ⚠️ Potential issue | 🟡 Minor

Title vs. assertion mismatch still unresolved.

Line 389 advertises StartLimitBurst >= 100 while Line 394 asserts exact equality to "StartLimitBurst=100". Either loosen the assertion to >= 100 (parse the key=value and compare numerically) or tighten the title to = 100. Also, strings.TrimSpace(startLimitBurst) on Line 394 is redundant since startLimitBurst was already trimmed on Line 392.

Proposed tweak (if exact match is intended)
-	It("[test_id:88711] should have irqbalance StartLimitBurst >= 100", Label(string(label.Tier0)), func() {
+	It("[test_id:88711] should have irqbalance StartLimitBurst = 100", Label(string(label.Tier0)), func() {
 		startLimitBurst, err := systemd.ShowProperty(context.TODO(), "irqbalance.service", "StartLimitBurst", targetNode)
 		Expect(err).ToNot(HaveOccurred())
 		startLimitBurst = strings.TrimSpace(startLimitBurst)
 		testlog.Infof("irqbalance.service %s on node %s", startLimitBurst, targetNode.Name)
-		Expect(strings.TrimSpace(startLimitBurst)).To(Equal("StartLimitBurst=100"),
+		Expect(startLimitBurst).To(Equal("StartLimitBurst=100"),
 			"irqbalance must have StartLimitBurst=100 (OCPBUGS-45112)")
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around
lines 389 - 395, The test currently trims the property string twice and asserts
exact equality while the title says ">= 100"; change the assertion to parse the
returned startLimitBurst string (from systemd.ShowProperty for
"irqbalance.service") by trimming once, splitting the "StartLimitBurst=VALUE"
form to extract the numeric VALUE, convert it to an int, and assert VALUE >=
100; remove the redundant strings.TrimSpace call on Line 394 and update the
Expect message to reflect the >= 100 requirement (or alternatively change the
test title to "= 100" if you prefer exact equality).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 389-395: The test currently trims the property string twice and
asserts exact equality while the title says ">= 100"; change the assertion to
parse the returned startLimitBurst string (from systemd.ShowProperty for
"irqbalance.service") by trimming once, splitting the "StartLimitBurst=VALUE"
form to extract the numeric VALUE, convert it to an int, and assert VALUE >=
100; remove the redundant strings.TrimSpace call on Line 394 and update the
Expect message to reflect the >= 100 requirement (or alternatively change the
test title to "= 100" if you prefer exact equality).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7258a7f7-0d83-4e72-adad-2a54a354ebf2

📥 Commits

Reviewing files that changed from the base of the PR and between c6a0443 and 2a142ce.

📒 Files selected for processing (1)
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Apr 21, 2026

/retest

@oblau oblau force-pushed the automation/10-pods-crash-irqbalance branch from 2a142ce to 5e9b294 Compare April 27, 2026 06:13
Expect(err).ToNot(HaveOccurred())
startLimitBurst = strings.TrimSpace(startLimitBurst)
testlog.Infof("irqbalance.service %s on node %s", startLimitBurst, targetNode.Name)
Expect(strings.TrimSpace(startLimitBurst)).To(Equal("StartLimitBurst=100"), "irqbalance must have StartLimitBurst=100 (OCPBUGS-45112)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to check it is 100 or more here. In case we change it in the future. That is why I did not want to depend on the config, but I agree the full e2e test is more fragile and tests systemd behavior that should be covered by systemd itself.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added util to see property's value using --value flag.
Now checking if the value is >= 100.
My worry is - would we not prefer this to fail if the number increases in the future?
If it increases its likely because 100 wasn't enough - so the test would not flag a system with "old" value of 100.

For example if the old test was for >= 5 we wouldn't know if the old value or the new were there.

Maybe i'm overthinking or missing something. would like to hear your thoughts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that is an interesting point. Failing the test to make sure the test stays aligned with the code. However, we do not own the code in this case and so a cri-o change would suddently fail our tests. I am not sure what is better.

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Apr 27, 2026

/verified

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

Details

In response to this:

/verified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Apr 27, 2026

/verified by oblau

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This PR has been marked as verified by oblau.

Details

In response to this:

/verified by oblau

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@oblau oblau changed the title e2e: Add irqbalance crash test with 10 pods OCPBUGS-45112: e2e: Add irqbalance crash test with 10 pods Apr 27, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This pull request references Jira Issue OCPBUGS-45112, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.19.0" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Done-Errata) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This pr Is test automation for OCPBUGS-45112

The bug:
The underlying cause was the default configuration:
StartLimitIntervalUSec=10s StartLimitBurst=5 meaning over 5 restarts in a 10 sec window would cause ActiveState=failed
Each pod create/delete with irq-load-balancing.crio.io = "disable" annotation causes irqbalance to restart.
In addition restarts coalesce meaning this is timing dependent thus not every deployment with more than 5 pods actually causes over 5 restarts.

The fix:
drop-in sets StartLimitBurst=100.
Checking this irqbalance systemd property is enough - no functional test is required

Summary by CodeRabbit

  • Tests
  • Renamed a performance test suite for clearer identification of IRQBalance checks, improving test readability.
  • Added a new Tier0 end-to-end test that verifies the IRQBalance service startup limit burst is exactly 100, increasing coverage for system startup configuration and catching regressions early.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Automating OCPBUGS-45112 - Config test
CRI-O restarts irqbalance on every guaranteed pod create/delete.
The default systemd StartLimitBurst=5 was too low
Fix was drop-in from cri-o raising StartLimitBurst to 100.
Checking for this updated value

added systemd.ShowPropertyValue to utils/systemd/systemd.go

Signed-off-by: oblau <oblau@redhat.com>
@oblau oblau force-pushed the automation/10-pods-crash-irqbalance branch from 5e9b294 to 68c910d Compare April 27, 2026 10:12
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026
@oblau
Copy link
Copy Markdown
Member Author

oblau commented Apr 28, 2026

/retest

2 similar comments
@oblau
Copy link
Copy Markdown
Member Author

oblau commented Apr 28, 2026

/retest

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Apr 29, 2026

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 29, 2026

@oblau: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift-pao 68c910d link true /test e2e-hypershift-pao

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants