Skip to content

Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749

Open
timothy-e wants to merge 5 commits intoapache:masterfrom
timothy-e:fix-ingestion-delay
Open

Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749
timothy-e wants to merge 5 commits intoapache:masterfrom
timothy-e:fix-ingestion-delay

Conversation

@timothy-e
Copy link

@timothy-e timothy-e commented Feb 23, 2026

When ingestionInfo is null or firstStreamIngestionTimeMs is negative (e.g., Long.MIN_VALUE), the method incorrectly computed clock.millis() - 0, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).

This fix adds an early null return for invalid/missing timestamps. The upstream commit bea67d04 previously returned 0 in such cases, but that's also not desirable because cases where nothing is published can appear in the metrics to be fine.

A new metric is added to report if ingestion data is valid, to handle cases where firstStreamIngestionTimeMs is never set properly.

$ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false

[INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

A version of this (returning 0, not null) has been running internally on Stripe's Pinot clusters for a few weeks with no issues - prior to this bug fix, we saw 56 year ingestion lag in several cases (sometimes intermittently).

priyen-stripe and others added 2 commits February 23, 2026 15:26
…stamps (long.min for example) (apache#524)

### Notify
cc stripe-private-oss-forks/pinot-reviewers

### Summary
When `ingestionInfo` is null or `firstStreamIngestionTimeMs` is negative (e.g., `Long.MIN_VALUE`), the method incorrectly computed `clock.millis() - 0`, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).

This fix adds an early return of 0 for invalid/missing timestamps, restoring the original behavior from upstream commit [bea67d04](apache@bea67d04).

I'm going to OSS this also, but for the sake of not slowing down pinot 1.5; need to get this in here faster

investigated more in https://docs.google.com/document/d/19EUPSq2xjEBiGHynGgZmTMZOIB7zmm2FwdzdC8FjqHg/edit?tab=t.0

but after deployment in rad-canary, we can see the fix in action:
![image](https://git.corp.stripe.com/user-attachments/assets/f60b44d9-4379-42a0-8cb0-a359c6276b86)

i've made it show the dedup table doesnt use the header, but the non-dedup ones do, and our upstream code sets a value of 1 hour ago, hence the 1 hour lag being shown

we don't see the 56 years anymore prior:
![image](https://git.corp.stripe.com/user-attachments/assets/b0db590b-d1b3-4033-b54e-68290aa2ee3c)



### Motivation
https://jira.corp.stripe.com/browse/STREAMANALYTICS-4191

### Testing
deployed on rad-canary QA and metric looks correct now

also updated tests
```
```
$ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false

[INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```
```

### Rollout/monitoring/revert plan
s:pinot-rad-canary


Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-01-30T21:51:39Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524
cc stripe-private-oss-forks/pinot-reviewers

https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524 fixed getPartitionEndToEndIngestionDelayMs, but getPartitionIngestionDelayMs has the same code pattern.

We saw the [ingestion lag for a table hit 56.2 years](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/d/DltsoWuVk/pinot-components?orgId=1&var-pinot_cluster=rad-noir&from=1770735477256&to=1770757067256&viewPanel=88).

Follow the same pattern to avoid returning the timestamp if the _firstStreamIngestionTimeMs is too low.

```
mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false
```
```
[INFO] Results:
[INFO]
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  15.716 s
[INFO] Finished at: 2026-02-10T17:31:51-05:00
[INFO] ------------------------------------------------------------------------
[INFO] 12 goals, 12 executed
```

Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-02-11T14:37:30Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/528
@timothy-e
Copy link
Author

@KKcorps @swaminathanmanish @9aman can you please TAL at this change? Thanks!

@yashmayya
Copy link
Contributor

cc @noob-se7en

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect “time since epoch” ingestion lag values for partitions without valid timestamps by making ingestion delay getters return 0 when ingestion info is missing or timestamps are invalid (e.g., Long.MIN_VALUE), restoring prior intended behavior.

Changes:

  • Return 0 from getPartitionEndToEndIngestionDelayMs() when ingestionInfo is missing or the first-stream ingestion timestamp is invalid.
  • Return 0 from getPartitionIngestionDelayMs() when ingestionInfo is missing or the ingestion timestamp is invalid.
  • Update unit tests to assert 0 delay for untracked/stopped partitions and for partitions with no updateMetrics() calls.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java Adds early returns to prevent computing delays against an implicit 0 timestamp when ingestion info is missing/invalid.
pinot-core/src/test/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTrackerTest.java Adjusts assertions to validate 0 delay behavior for untracked partitions and metrics sampling without updateMetrics().

Comment on lines +577 to +578
if (ingestionInfo == null || ingestionInfo._ingestionTimeMs < 0) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If consumer is stuck and has not updated its ingestion state, This will incorrectly report that delay lag as zero.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that only happen if the consumer never created an IngestionInfo / updated the _ingestionInfoMap? I feel like that's a seperate problem with the consumer's bootstrapping, so trying to capture that in ingestion delay doesn't seem right to me. Open to suggestions though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been cases where consumer ends up in deadlock or consumer thread just died.

As per this Pr's change the lag will be just reported as zero, and that to me seems definitely we shouldnt do. Worst case we can just report what we were previously doing -> (current timestamp - 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we simply don't emit the metric if there is no ingestion info? You don't want 0 emitted when it is not explicitly 0, but we don't want 56 years emitted. Not emitting in these cases satisfies both requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Any approach where user can get alerted that there is something wrong with ingestion will work.
Incase we are not reporting metric I believe inorder to catch this users will have to add some new alert which can catch metric is missing.
We can also expose a binary metric that gets enabled when ingestionInfo is null and we dont emit the original metric.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I updated the PR to return null instead of 0 when invalid, and expose a binary metric

@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
10809 1 10808 73
View the top 2 failed test(s) by shortest run time
org.apache.pinot.controller.api.PinotTableRestletResourceTest::testTableTasksCleanupWithNonActiveTasks
Stack Traces | 0.292s run time
org.apache.pinot.common.exception.HttpErrorStatusException: Got error status code: 500 (Internal Server Error) with reason: &quot;Failed to delete job: TaskQueue_SegmentGenerationAndPushTask_Task_SegmentGenerationAndPushTask_5888f0ff-c8b0-48f4-bd30-5e2ec5317721_1771883163334 from queue: TaskQueue_SegmentGenerationAndPushTask&quot; while sending request: /tables/testTableTasksCleanup to controller: runnervmwffz4.r4talzohrrqu3kjbgbocduguig.ex.internal.cloudapp.net, version: Unknown
org.apache.pinot.controller.api.PinotTableRestletResourceTest::testTableTasksCleanupWithNonActiveTasks
Stack Traces | 0.349s run time
org.apache.pinot.common.exception.HttpErrorStatusException: Got error status code: 500 (Internal Server Error) with reason: "Failed to delete job: TaskQueue_SegmentGenerationAndPushTask_Task_SegmentGenerationAndPushTask_5888f0ff-c8b0-48f4-bd30-5e2ec5317721_1771883163334 from queue: TaskQueue_SegmentGenerationAndPushTask" while sending request: /tables/testTableTasksCleanup to controller: runnervmwffz4.r4talzohrrqu3kjbgbocduguig.ex.internal.cloudapp.net, version: Unknown
View the full list of 1 ❄️ flaky test(s)
org.apache.pinot.integration.tests.ExactlyOnceKafkaRealtimeClusterIntegrationTest::setUp

Flake rate in main: 100.00% (Passed 0 times, Failed 24 times)

Stack Traces | 1258s run time
Failed to load 115545 documents; current count=57801 for table=mytable expected [115545] but found [57801]

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

timothy-e and others added 2 commits February 24, 2026 10:00
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@timothy-e timothy-e changed the title Fix getPartition(EndToEnd)IngestionDelayMs to return 0 for invalid timestamps (long.min for example) Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example) Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants