Speed up duplicate data validator by AustinWheel · Pull Request #802 · Netflix/hollow

AustinWheel · 2026-03-10T17:34:46Z

Here are results from the benchmark when we use just the delta to check for duplicates:

  Records   Delta %   deltaBounded   snapshotFullScan (baseline)   Speedup                                                                                                                   
       1M      0.1%      0.079 ms                       30.0 ms     ~380x                                                              
       1M        1%      0.801 ms                       30.2 ms      ~38x                                                                                                                    
       1M       10%      9.747 ms                       30.0 ms       ~3x                                                                                                                    
       5M      0.1%      0.593 ms                      249.1 ms     ~420x                                                                                                                    
       5M        1%      8.905 ms                      294.6 ms      ~33x                                                              
       5M       10%    130.499 ms                      276.5 ms       ~2x

And this tracks as the small the delta % the less ordinals there are to check, and those gains are multiplied by the size of the dataset.

Full output:

Benchmark                                                  (deltaFraction)  (numRecords)  Mode  Cnt    Score    Error  Units                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                0.001       1000000  avgt   20    0.079 ±  0.001  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                0.001       5000000  avgt   20    0.593 ±  0.018  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                 0.01       1000000  avgt   20    0.801 ±  0.041  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                 0.01       5000000  avgt   20    8.905 ±  0.362  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                  0.1       1000000  avgt   20    9.747 ±  0.412  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.deltaBounded                  0.1       5000000  avgt   20  130.499 ±  4.523  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded             0.001       1000000  avgt   20   32.039 ±  0.459  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded             0.001       5000000  avgt   20  322.196 ± 13.048  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded              0.01       1000000  avgt   20   32.298 ±  0.339  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded              0.01       5000000  avgt   20  308.809 ± 12.248  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded               0.1       1000000  avgt   20   32.572 ±  0.738  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotBounded               0.1       5000000  avgt   20  290.514 ± 11.442  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan            0.001       1000000  avgt   20   29.882 ±  0.185  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan            0.001       5000000  avgt   20  249.115 ± 20.437  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan             0.01       1000000  avgt   20   30.235 ±  0.639  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan             0.01       5000000  avgt   20  294.560 ± 11.769  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan              0.1       1000000  avgt   20   30.029 ±  0.303  ms/op                                                                 
DuplicateDataDetectionValidatorBenchmark.snapshotFullScan              0.1       5000000  avgt   20  276.480 ± 15.275  ms/op

AustinWheel · 2026-03-10T21:31:52Z

There is one implication to call out about doing duplicate detection with deltas which is that delta-detection won't find pre-existing duplicates. The snapshot path (when previousOrdinals is empty) does detect pre-existing duplicates, but if you added DuplicateDataDetectionValidator after a few cycles have run and duplicates exist, then those duplicates wouldn't necessarily be detected, that is until a snapshot cycle or a producer restart.

Sunjeet · 2026-03-11T22:59:20Z


 dependencies {
  implementation project(':hollow')
+  implementation project(':hollow-test')


Is this used?

Sunjeet · 2026-03-11T23:16:58Z

would be good to support both modes for duplicate data detector esp. as the incremental validator rolls out, should allow fallback to the old impl. + the old impl would remain useful in few cases where even snapshots need to validate for duplicates

Sunjeet · 2026-03-11T23:28:33Z

Curious to try another approach- could the duplicate detection happen in the validator impl (the only place its needed) vs. in the pirmary key index? The key thing to figure out with that impl would be how to handle updating the index so that when the validator is invoked the index corresponds to the prior read state engine. But if implemented that way it would be a cleaner separation of concerns. Here's why:

  The index's job is to answer "given a key, what ordinal matches?" — it already does this well with getMatch(Object... keys). The delta-aware optimization is really a validation concern, not an indexing concern.                                                                                                                                                           
   
  The validator could do this:                                                                                                                                                                                                                                                                                                                                                 
  1. Get the PopulatedOrdinalListener (or equivalent) to find new ordinals                                                                                                                                                                                                                                                                                                   
  2. For each new ordinal, derive its key using HollowPrimaryKeyValueDeriver
  3. Call the index's existing getMatch() — if the returned ordinal differs from the new ordinal, it's a duplicate
  4. Cap at maxDuplicateKeys and stop early

  This has several advantages:

  - Index stays simple — getDuplicateKeys() (full scan) is the only duplicate detection method on the index, easy to reason about correctness
  - Delta optimization lives where the policy decision is — the validator decides whether to check all records or only new ones, rather than burying that decision inside the index
  - No new state coupling — the index doesn't need to reach into PopulatedOrdinalListener for something that isn't core to indexing
  - Testability — validator logic is tested at the validator level, index tests stay focused on lookup correctness

AustinWheel and others added 2 commits March 10, 2026 10:18

Save current changes and benchmarking

7e89198

Cleanup duplicate key detection and testing

0878134

AustinWheel marked this pull request as ready for review March 10, 2026 21:17

AustinWheel requested a review from spjegan March 10, 2026 21:17

Fix minor speed regression in snapshot path

c73b5d7

Sunjeet reviewed Mar 11, 2026

View reviewed changes

AustinNflx added 2 commits March 27, 2026 12:35

Isolate delta-aware and full table scan duplicate detection

9dcf4e4

Move duplicate data validation from the index into the validator

8314a3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up duplicate data validator#802

Speed up duplicate data validator#802
AustinWheel wants to merge 5 commits intomasterfrom
awheeler/duplicate-data-validator-speedup

AustinWheel commented Mar 10, 2026 •

edited

Loading

Uh oh!

AustinWheel commented Mar 10, 2026 •

edited

Loading

Uh oh!

Sunjeet Mar 11, 2026

Uh oh!

Sunjeet commented Mar 11, 2026

Uh oh!

Sunjeet commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AustinWheel commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AustinWheel commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sunjeet Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Sunjeet commented Mar 11, 2026

Uh oh!

Sunjeet commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AustinWheel commented Mar 10, 2026 •

edited

Loading

AustinWheel commented Mar 10, 2026 •

edited

Loading