Skip to content

Speed up duplicate data validator#802

Open
AustinWheel wants to merge 5 commits intomasterfrom
awheeler/duplicate-data-validator-speedup
Open

Speed up duplicate data validator#802
AustinWheel wants to merge 5 commits intomasterfrom
awheeler/duplicate-data-validator-speedup

Conversation

@AustinWheel
Copy link
Copy Markdown
Collaborator

@AustinWheel AustinWheel commented Mar 10, 2026

Here are results from the benchmark when we use just the delta to check for duplicates:

  Records   Delta %   deltaBounded   snapshotFullScan (baseline)   Speedup                                                                                                                   
       1M      0.1%      0.079 ms                       30.0 ms     ~380x                                                              
       1M        1%      0.801 ms                       30.2 ms      ~38x                                                                                                                    
       1M       10%      9.747 ms                       30.0 ms       ~3x                                                                                                                    
       5M      0.1%      0.593 ms                      249.1 ms     ~420x                                                                                                                    
       5M        1%      8.905 ms                      294.6 ms      ~33x                                                              
       5M       10%    130.499 ms                      276.5 ms       ~2x

And this tracks as the small the delta % the less ordinals there are to check, and those gains are multiplied by the size of the dataset.

Full output:

Benchmark                                                  (deltaFraction)  (numRecords)  Mode  Cnt    Score    Error  Units                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                0.001       1000000  avgt   20    0.079 ±  0.001  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                0.001       5000000  avgt   20    0.593 ±  0.018  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                 0.01       1000000  avgt   20    0.801 ±  0.041  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                 0.01       5000000  avgt   20    8.905 ±  0.362  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                  0.1       1000000  avgt   20    9.747 ±  0.412  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                  0.1       5000000  avgt   20  130.499 ±  4.523  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded             0.001       1000000  avgt   20   32.039 ±  0.459  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded             0.001       5000000  avgt   20  322.196 ± 13.048  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded              0.01       1000000  avgt   20   32.298 ±  0.339  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded              0.01       5000000  avgt   20  308.809 ± 12.248  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded               0.1       1000000  avgt   20   32.572 ±  0.738  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded               0.1       5000000  avgt   20  290.514 ± 11.442  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan            0.001       1000000  avgt   20   29.882 ±  0.185  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan            0.001       5000000  avgt   20  249.115 ± 20.437  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan             0.01       1000000  avgt   20   30.235 ±  0.639  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan             0.01       5000000  avgt   20  294.560 ± 11.769  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan              0.1       1000000  avgt   20   30.029 ±  0.303  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan              0.1       5000000  avgt   20  276.480 ± 15.275  ms/op

@AustinWheel AustinWheel marked this pull request as ready for review March 10, 2026 21:17
@AustinWheel AustinWheel requested a review from spjegan March 10, 2026 21:17
@AustinWheel
Copy link
Copy Markdown
Collaborator Author

AustinWheel commented Mar 10, 2026

There is one implication to call out about doing duplicate detection with deltas which is that delta-detection won't find pre-existing duplicates. The snapshot path (when previousOrdinals is empty) does detect pre-existing duplicates, but if you added DuplicateDataDetectionValidator after a few cycles have run and duplicates exist, then those duplicates wouldn't necessarily be detected, that is until a snapshot cycle or a producer restart.

Comment thread hollow-perf/build.gradle Outdated

dependencies {
implementation project(':hollow')
implementation project(':hollow-test')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

@Sunjeet
Copy link
Copy Markdown
Collaborator

Sunjeet commented Mar 11, 2026

would be good to support both modes for duplicate data detector esp. as the incremental validator rolls out, should allow fallback to the old impl. + the old impl would remain useful in few cases where even snapshots need to validate for duplicates

@Sunjeet
Copy link
Copy Markdown
Collaborator

Sunjeet commented Mar 11, 2026

Curious to try another approach- could the duplicate detection happen in the validator impl (the only place its needed) vs. in the pirmary key index? The key thing to figure out with that impl would be how to handle updating the index so that when the validator is invoked the index corresponds to the prior read state engine. But if implemented that way it would be a cleaner separation of concerns. Here's why:

  The index's job is to answer "given a key, what ordinal matches?" — it already does this well with getMatch(Object... keys). The delta-aware optimization is really a validation concern, not an indexing concern.                                                                                                                                                           
   
  The validator could do this:                                                                                                                                                                                                                                                                                                                                                 
  1. Get the PopulatedOrdinalListener (or equivalent) to find new ordinals                                                                                                                                                                                                                                                                                                   
  2. For each new ordinal, derive its key using HollowPrimaryKeyValueDeriver
  3. Call the index's existing getMatch() — if the returned ordinal differs from the new ordinal, it's a duplicate
  4. Cap at maxDuplicateKeys and stop early

  This has several advantages:

  - Index stays simple — getDuplicateKeys() (full scan) is the only duplicate detection method on the index, easy to reason about correctness
  - Delta optimization lives where the policy decision is — the validator decides whether to check all records or only new ones, rather than burying that decision inside the index
  - No new state coupling — the index doesn't need to reach into PopulatedOrdinalListener for something that isn't core to indexing
  - Testability — validator logic is tested at the validator level, index tests stay focused on lookup correctness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants