feat(ipc): Avoid quadratic dictionary delta accumulation in IPC reader#9776
feat(ipc): Avoid quadratic dictionary delta accumulation in IPC reader#9776pchintar wants to merge 1 commit intoapache:mainfrom
Conversation
|
Thanks @pchintar Do you have any benchmarks that show the performance improvement for this PR? |
|
Thanks for the response @alamb . I haven’t run dedicated benchmarks yet. This change targets a specific accumulation pattern (repeated delta dictionary updates), which may not be directly covered by existing benchmarks and so may require a custom setup. Due to system/hardware constraints I wasn’t able to put that together, but the change removes a quadratic copy pattern and replaces it with a linear one, so the expected improvement must follow from that. I’m happy to follow up with a microbenchmark if needed. |
|
I understand it improves the code in this case; However given limited review bandwidth we are much more likely to have time to review improvements if they come with benchmarks / some other evidence to help us understand how mucht hey will help |
Which issue does this PR close?
Rationale for this change
Delta dictionaries are currently handled in the reader as:
This executes for every
DictionaryBatchwithisDelta = true, even before anyRecordBatchconsumes the dictionary. So, for a sequence of delta updates:Total work:
where
Nis the number of accumulated dictionary elements.What changes are included in this PR?
Defer dictionary materialization until it is required for record batch decoding.
Maintain:
Behavior:
Non-delta dictionary
Delta dictionary
Before
RecordBatchdecodeconcatenate:
update the materialized dictionary map
clear the pending list
Complexity Improvement
Before / Current Behavior
Every time a delta dictionary arrives, we rebuild the full dictionary:
That means:
After / New Behavior
We just store the incoming entries and delay building the full dictionary:
Then only when a record batch actually needs it:
Are these changes tested?
Modified components:
arrow-ipc/src/reader.rsStreamReaderFileDecoderarrow-ipc/src/reader/stream.rsStreamDecoderTests:
Added:
arrow-ipc/src/tests/delta_dictionary.rstest_multiple_deltas_before_record_batchEnsures correct behavior when multiple delta updates occur before record batch decoding
Benchmarks and examples updated to handle the
&mut selfrequirement introduced byFileDecoder::read_record_batchAre there any user-facing changes?
Internal only. No public API or IPC format changes.