Skip to content

[flink][test] Fix OOM when startTaskManager in FlinkMetricsITCase#2864

Open
Prajwal-banakar wants to merge 1 commit intoapache:mainfrom
Prajwal-banakar:Bug-fix-#2744
Open

[flink][test] Fix OOM when startTaskManager in FlinkMetricsITCase#2864
Prajwal-banakar wants to merge 1 commit intoapache:mainfrom
Prajwal-banakar:Bug-fix-#2744

Conversation

@Prajwal-banakar
Copy link
Contributor

Purpose

Linked issue: close #2744

Fixes an OutOfMemoryError: Could not allocate enough memory segments for NetworkBufferPool that occurred when TaskManagerRunner.startTaskManager was called in FlinkMetricsITCase (and its Flink-version subclasses Flink119MetricsITCase, Flink120MetricsITCase, etc.) during sequential IT case execution in the same JVM fork.

Brief change log

The root cause is that MiniClusterWithClientResource allocates JVM direct memory via NetworkBufferPool during before(), and this memory was not reliably released between test classes, exhausting the JVM direct memory budget for subsequent classes.
Three changes were made to FlinkMetricsITCase:

beforeAll: Wrap MINI_CLUSTER_EXTENSION.before() in a try/catch that explicitly calls MINI_CLUSTER_EXTENSION.after() on failure. JUnit 5 does not invoke @afterall when @BeforeAll throws, so without this, any direct memory partially allocated before the failure would never be freed.
afterAll: Wrap resource cleanup in a try/finally block so that MINI_CLUSTER_EXTENSION.after() is always called even if admin.close() or conn.close() throws.
buildTestConfig: Reduce the NetworkBufferPool size from the default 64MB to 32MB via taskmanager.memory.network.min/max. These tests do not exercise high-throughput network paths, so the smaller size is sufficient and reduces direct memory pressure when multiple IT cases run in the same JVM fork.

Tests

Flink118MetricsITCase — passes
Flink119MetricsITCase — passes
Flink120MetricsITCase — passes
Flink22MetricsITCase — passes
Full fluss-flink-1.20 module (mvn verify -pl fluss-flink/fluss-flink-1.20 -am) — BUILD SUCCESS (225 IT tests, 0 failures), confirming no regressions introduced by this change

API and Format

No API or storage format changes

Documentation

No new feature introduced. No documentation changes required.

@Prajwal-banakar
Copy link
Contributor Author

CC @loserwang1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[test] OutOfMemoryError when startTaskManager in Flink IT case

1 participant