Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example) by timothy-e · Pull Request #17749 · apache/pinot

timothy-e · 2026-02-23T20:38:55Z

When ingestionInfo is null or firstStreamIngestionTimeMs is negative (e.g., Long.MIN_VALUE), the method incorrectly computed clock.millis() - 0, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).

This fix adds an early null return for invalid/missing timestamps. The upstream commit bea67d04 previously returned 0 in such cases, but that's also not desirable because cases where nothing is published can appear in the metrics to be fine.

A new metric is added to report if ingestion data is valid, to handle cases where firstStreamIngestionTimeMs is never set properly.

$ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false

[INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

A version of this (returning 0, not null) has been running internally on Stripe's Pinot clusters for a few weeks with no issues - prior to this bug fix, we saw 56 year ingestion lag in several cases (sometimes intermittently).

…stamps (long.min for example) (apache#524) ### Notify cc stripe-private-oss-forks/pinot-reviewers ### Summary When `ingestionInfo` is null or `firstStreamIngestionTimeMs` is negative (e.g., `Long.MIN_VALUE`), the method incorrectly computed `clock.millis() - 0`, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970). This fix adds an early return of 0 for invalid/missing timestamps, restoring the original behavior from upstream commit [bea67d04](apache@bea67d04). I'm going to OSS this also, but for the sake of not slowing down pinot 1.5; need to get this in here faster investigated more in https://docs.google.com/document/d/19EUPSq2xjEBiGHynGgZmTMZOIB7zmm2FwdzdC8FjqHg/edit?tab=t.0 but after deployment in rad-canary, we can see the fix in action: ![image](https://git.corp.stripe.com/user-attachments/assets/f60b44d9-4379-42a0-8cb0-a359c6276b86) i've made it show the dedup table doesnt use the header, but the non-dedup ones do, and our upstream code sets a value of 1 hour ago, hence the 1 hour lag being shown we don't see the 56 years anymore prior: ![image](https://git.corp.stripe.com/user-attachments/assets/b0db590b-d1b3-4033-b54e-68290aa2ee3c) ### Motivation https://jira.corp.stripe.com/browse/STREAMANALYTICS-4191 ### Testing deployed on rad-canary QA and metric looks correct now also updated tests ``` ``` $ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false [INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] BUILD SUCCESS ``` ``` ### Rollout/monitoring/revert plan s:pinot-rad-canary Stripe-Original-Repo: stripe-private-oss-forks/pinot Stripe-Monotonic-Timestamp: v2/2026-01-30T21:51:39Z/0 Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524

cc stripe-private-oss-forks/pinot-reviewers https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524 fixed getPartitionEndToEndIngestionDelayMs, but getPartitionIngestionDelayMs has the same code pattern. We saw the [ingestion lag for a table hit 56.2 years](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/d/DltsoWuVk/pinot-components?orgId=1&var-pinot_cluster=rad-noir&from=1770735477256&to=1770757067256&viewPanel=88). Follow the same pattern to avoid returning the timestamp if the _firstStreamIngestionTimeMs is too low. ``` mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false ``` ``` [INFO] Results: [INFO] [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 15.716 s [INFO] Finished at: 2026-02-10T17:31:51-05:00 [INFO] ------------------------------------------------------------------------ [INFO] 12 goals, 12 executed ``` Stripe-Original-Repo: stripe-private-oss-forks/pinot Stripe-Monotonic-Timestamp: v2/2026-02-11T14:37:30Z/0 Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/528

timothy-e · 2026-02-23T20:41:59Z

@KKcorps @swaminathanmanish @9aman can you please TAL at this change? Thanks!

yashmayya · 2026-02-23T21:13:26Z

cc @noob-se7en

Copilot

Pull request overview

Fixes incorrect “time since epoch” ingestion lag values for partitions without valid timestamps by making ingestion delay getters return 0 when ingestion info is missing or timestamps are invalid (e.g., Long.MIN_VALUE), restoring prior intended behavior.

Changes:

Return 0 from getPartitionEndToEndIngestionDelayMs() when ingestionInfo is missing or the first-stream ingestion timestamp is invalid.
Return 0 from getPartitionIngestionDelayMs() when ingestionInfo is missing or the ingestion timestamp is invalid.
Update unit tests to assert 0 delay for untracked/stopped partitions and for partitions with no updateMetrics() calls.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java`	Adds early returns to prevent computing delays against an implicit `0` timestamp when ingestion info is missing/invalid.
`pinot-core/src/test/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTrackerTest.java`	Adjusts assertions to validate `0` delay behavior for untracked partitions and metrics sampling without `updateMetrics()`.

pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java

noob-se7en · 2026-02-23T21:30:43Z

pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java

+    if (ingestionInfo == null || ingestionInfo._ingestionTimeMs < 0) {
+      return 0;


If consumer is stuck and has not updated its ingestion state, This will incorrectly report that delay lag as zero.

Wouldn't that only happen if the consumer never created an IngestionInfo / updated the _ingestionInfoMap? I feel like that's a seperate problem with the consumer's bootstrapping, so trying to capture that in ingestion delay doesn't seem right to me. Open to suggestions though

There has been cases where consumer ends up in deadlock or consumer thread just died.

As per this Pr's change the lag will be just reported as zero, and that to me seems definitely we shouldnt do. Worst case we can just report what we were previously doing -> (current timestamp - 0)

What if we simply don't emit the metric if there is no ingestion info? You don't want 0 emitted when it is not explicitly 0, but we don't want 56 years emitted. Not emitting in these cases satisfies both requirements.

Sure. Any approach where user can get alerted that there is something wrong with ingestion will work.
Incase we are not reporting metric I believe inorder to catch this users will have to add some new alert which can catch metric is missing.
We can also expose a binary metric that gets enabled when ingestionInfo is null and we dont emit the original metric.

Thanks for the suggestion! I updated the PR to return null instead of 0 when invalid, and expose a binary metric

codecov-commenter · 2026-02-23T21:57:55Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
10809	1	10808	73

View the top 2 failed test(s) by shortest run time

org.apache.pinot.controller.api.PinotTableRestletResourceTest::testTableTasksCleanupWithNonActiveTasks

Stack Traces | 0.292s run time

org.apache.pinot.common.exception.HttpErrorStatusException: Got error status code: 500 (Internal Server Error) with reason: &quot;Failed to delete job: TaskQueue_SegmentGenerationAndPushTask_Task_SegmentGenerationAndPushTask_5888f0ff-c8b0-48f4-bd30-5e2ec5317721_1771883163334 from queue: TaskQueue_SegmentGenerationAndPushTask&quot; while sending request: /tables/testTableTasksCleanup to controller: runnervmwffz4.r4talzohrrqu3kjbgbocduguig.ex.internal.cloudapp.net, version: Unknown

org.apache.pinot.controller.api.PinotTableRestletResourceTest::testTableTasksCleanupWithNonActiveTasks

Stack Traces | 0.349s run time

org.apache.pinot.common.exception.HttpErrorStatusException: Got error status code: 500 (Internal Server Error) with reason: "Failed to delete job: TaskQueue_SegmentGenerationAndPushTask_Task_SegmentGenerationAndPushTask_5888f0ff-c8b0-48f4-bd30-5e2ec5317721_1771883163334 from queue: TaskQueue_SegmentGenerationAndPushTask" while sending request: /tables/testTableTasksCleanup to controller: runnervmwffz4.r4talzohrrqu3kjbgbocduguig.ex.internal.cloudapp.net, version: Unknown

View the full list of 1 ❄️ flaky test(s)

org.apache.pinot.integration.tests.ExactlyOnceKafkaRealtimeClusterIntegrationTest::setUp
Flake rate in main: 100.00% (Passed 0 times, Failed 24 times)
Stack Traces | 1258s run time
Failed to load 115545 documents; current count=57801 for table=mytable expected [115545] but found [57801]

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

priyen-stripe and others added 2 commits February 23, 2026 15:26

yashmayya requested review from KKcorps and Copilot February 23, 2026 21:12

yashmayya assigned swaminathanmanish and unassigned swaminathanmanish Feb 23, 2026

yashmayya requested a review from swaminathanmanish February 23, 2026 21:13

Copilot started reviewing on behalf of yashmayya February 23, 2026 21:13 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java Outdated Show resolved Hide resolved

noob-se7en reviewed Feb 23, 2026

View reviewed changes

Jackie-Jiang added ingestion observability real-time metrics labels Feb 23, 2026

timothy-e and others added 2 commits February 24, 2026 10:00

Remove unused long ingestionTimeMs = 0

bec8c22

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

return null if no ingestion info

876e233

timothy-e changed the title ~~Fix getPartition(EndToEnd)IngestionDelayMs to return 0 for invalid timestamps (long.min for example)~~ Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example) Mar 3, 2026

add HAS_DATA gauge

6383c9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749

Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749
timothy-e wants to merge 5 commits intoapache:masterfrom
timothy-e:fix-ingestion-delay

timothy-e commented Feb 23, 2026 •

edited

Loading

Uh oh!

timothy-e commented Feb 23, 2026

Uh oh!

yashmayya commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

noob-se7en Feb 23, 2026

Uh oh!

timothy-e Feb 24, 2026

Uh oh!

noob-se7en Feb 24, 2026

Uh oh!

timothye-gh Feb 25, 2026

Uh oh!

noob-se7en Feb 26, 2026

Uh oh!

timothy-e Mar 3, 2026

Uh oh!

codecov-commenter commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

		if (ingestionInfo == null \|\| ingestionInfo._ingestionTimeMs < 0) {
		return 0;

Conversation

timothy-e commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timothy-e commented Feb 23, 2026

Uh oh!

yashmayya commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

noob-se7en Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

timothy-e Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

noob-se7en Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

timothye-gh Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

noob-se7en Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

timothy-e Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

timothy-e commented Feb 23, 2026 •

edited

Loading

codecov-commenter commented Feb 23, 2026 •

edited

Loading