Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749
Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749timothy-e wants to merge 5 commits intoapache:masterfrom
Conversation
…stamps (long.min for example) (apache#524) ### Notify cc stripe-private-oss-forks/pinot-reviewers ### Summary When `ingestionInfo` is null or `firstStreamIngestionTimeMs` is negative (e.g., `Long.MIN_VALUE`), the method incorrectly computed `clock.millis() - 0`, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970). This fix adds an early return of 0 for invalid/missing timestamps, restoring the original behavior from upstream commit [bea67d04](apache@bea67d04). I'm going to OSS this also, but for the sake of not slowing down pinot 1.5; need to get this in here faster investigated more in https://docs.google.com/document/d/19EUPSq2xjEBiGHynGgZmTMZOIB7zmm2FwdzdC8FjqHg/edit?tab=t.0 but after deployment in rad-canary, we can see the fix in action:  i've made it show the dedup table doesnt use the header, but the non-dedup ones do, and our upstream code sets a value of 1 hour ago, hence the 1 hour lag being shown we don't see the 56 years anymore prior:  ### Motivation https://jira.corp.stripe.com/browse/STREAMANALYTICS-4191 ### Testing deployed on rad-canary QA and metric looks correct now also updated tests ``` ``` $ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false [INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] BUILD SUCCESS ``` ``` ### Rollout/monitoring/revert plan s:pinot-rad-canary Stripe-Original-Repo: stripe-private-oss-forks/pinot Stripe-Monotonic-Timestamp: v2/2026-01-30T21:51:39Z/0 Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524
cc stripe-private-oss-forks/pinot-reviewers https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524 fixed getPartitionEndToEndIngestionDelayMs, but getPartitionIngestionDelayMs has the same code pattern. We saw the [ingestion lag for a table hit 56.2 years](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/d/DltsoWuVk/pinot-components?orgId=1&var-pinot_cluster=rad-noir&from=1770735477256&to=1770757067256&viewPanel=88). Follow the same pattern to avoid returning the timestamp if the _firstStreamIngestionTimeMs is too low. ``` mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false ``` ``` [INFO] Results: [INFO] [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 15.716 s [INFO] Finished at: 2026-02-10T17:31:51-05:00 [INFO] ------------------------------------------------------------------------ [INFO] 12 goals, 12 executed ``` Stripe-Original-Repo: stripe-private-oss-forks/pinot Stripe-Monotonic-Timestamp: v2/2026-02-11T14:37:30Z/0 Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/528
|
@KKcorps @swaminathanmanish @9aman can you please TAL at this change? Thanks! |
|
cc @noob-se7en |
There was a problem hiding this comment.
Pull request overview
Fixes incorrect “time since epoch” ingestion lag values for partitions without valid timestamps by making ingestion delay getters return 0 when ingestion info is missing or timestamps are invalid (e.g., Long.MIN_VALUE), restoring prior intended behavior.
Changes:
- Return
0fromgetPartitionEndToEndIngestionDelayMs()wheningestionInfois missing or the first-stream ingestion timestamp is invalid. - Return
0fromgetPartitionIngestionDelayMs()wheningestionInfois missing or the ingestion timestamp is invalid. - Update unit tests to assert
0delay for untracked/stopped partitions and for partitions with noupdateMetrics()calls.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java |
Adds early returns to prevent computing delays against an implicit 0 timestamp when ingestion info is missing/invalid. |
pinot-core/src/test/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTrackerTest.java |
Adjusts assertions to validate 0 delay behavior for untracked partitions and metrics sampling without updateMetrics(). |
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Outdated
Show resolved
Hide resolved
| if (ingestionInfo == null || ingestionInfo._ingestionTimeMs < 0) { | ||
| return 0; |
There was a problem hiding this comment.
If consumer is stuck and has not updated its ingestion state, This will incorrectly report that delay lag as zero.
There was a problem hiding this comment.
Wouldn't that only happen if the consumer never created an IngestionInfo / updated the _ingestionInfoMap? I feel like that's a seperate problem with the consumer's bootstrapping, so trying to capture that in ingestion delay doesn't seem right to me. Open to suggestions though
There was a problem hiding this comment.
There has been cases where consumer ends up in deadlock or consumer thread just died.
As per this Pr's change the lag will be just reported as zero, and that to me seems definitely we shouldnt do. Worst case we can just report what we were previously doing -> (current timestamp - 0)
There was a problem hiding this comment.
What if we simply don't emit the metric if there is no ingestion info? You don't want 0 emitted when it is not explicitly 0, but we don't want 56 years emitted. Not emitting in these cases satisfies both requirements.
There was a problem hiding this comment.
Sure. Any approach where user can get alerted that there is something wrong with ingestion will work.
Incase we are not reporting metric I believe inorder to catch this users will have to add some new alert which can catch metric is missing.
We can also expose a binary metric that gets enabled when ingestionInfo is null and we dont emit the original metric.
There was a problem hiding this comment.
Thanks for the suggestion! I updated the PR to return null instead of 0 when invalid, and expose a binary metric
❌ 1 Tests Failed:
View the top 2 failed test(s) by shortest run time
View the full list of 1 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
When
ingestionInfois null orfirstStreamIngestionTimeMsis negative (e.g.,Long.MIN_VALUE), the method incorrectly computedclock.millis() - 0, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).This fix adds an early null return for invalid/missing timestamps. The upstream commit bea67d04 previously returned 0 in such cases, but that's also not desirable because cases where nothing is published can appear in the metrics to be fine.
A new metric is added to report if ingestion data is valid, to handle cases where firstStreamIngestionTimeMs is never set properly.
A version of this (returning 0, not null) has been running internally on Stripe's Pinot clusters for a few weeks with no issues - prior to this bug fix, we saw 56 year ingestion lag in several cases (sometimes intermittently).