You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here are results from the benchmark when we use just the delta to check for duplicates:
Records Delta % deltaBounded snapshotFullScan (baseline) Speedup
1M 0.1% 0.079 ms 30.0 ms ~380x
1M 1% 0.801 ms 30.2 ms ~38x
1M 10% 9.747 ms 30.0 ms ~3x
5M 0.1% 0.593 ms 249.1 ms ~420x
5M 1% 8.905 ms 294.6 ms ~33x
5M 10% 130.499 ms 276.5 ms ~2x
And this tracks as the small the delta % the less ordinals there are to check, and those gains are multiplied by the size of the dataset.
There is one implication to call out about doing duplicate detection with deltas which is that delta-detection won't find pre-existing duplicates. The snapshot path (when previousOrdinals is empty) does detect pre-existing duplicates, but if you added DuplicateDataDetectionValidator after a few cycles have run and duplicates exist, then those duplicates wouldn't necessarily be detected, that is until a snapshot cycle or a producer restart.
would be good to support both modes for duplicate data detector esp. as the incremental validator rolls out, should allow fallback to the old impl. + the old impl would remain useful in few cases where even snapshots need to validate for duplicates
Curious to try another approach- could the duplicate detection happen in the validator impl (the only place its needed) vs. in the pirmary key index? The key thing to figure out with that impl would be how to handle updating the index so that when the validator is invoked the index corresponds to the prior read state engine. But if implemented that way it would be a cleaner separation of concerns. Here's why:
The index's job is to answer "given a key, what ordinal matches?" — it already does this well with getMatch(Object... keys). The delta-aware optimization is really a validation concern, not an indexing concern.
The validator could do this:
1. Get the PopulatedOrdinalListener (or equivalent) to find new ordinals
2. For each new ordinal, derive its key using HollowPrimaryKeyValueDeriver
3. Call the index's existing getMatch() — if the returned ordinal differs from the new ordinal, it's a duplicate
4. Cap at maxDuplicateKeys and stop early
This has several advantages:
- Index stays simple — getDuplicateKeys() (full scan) is the only duplicate detection method on the index, easy to reason about correctness
- Delta optimization lives where the policy decision is — the validator decides whether to check all records or only new ones, rather than burying that decision inside the index
- No new state coupling — the index doesn't need to reach into PopulatedOrdinalListener for something that isn't core to indexing
- Testability — validator logic is tested at the validator level, index tests stay focused on lookup correctness
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Here are results from the benchmark when we use just the delta to check for duplicates:
And this tracks as the small the delta % the less ordinals there are to check, and those gains are multiplied by the size of the dataset.
Full output: