feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast#9768
Open
Abhisheklearn12 wants to merge 2 commits intoapache:mainfrom
Open
feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast#9768Abhisheklearn12 wants to merge 2 commits intoapache:mainfrom
Abhisheklearn12 wants to merge 2 commits intoapache:mainfrom
Conversation
Contributor
Author
|
Hi @Jefffrey, I’d love to get your feedback whenever you have time. Very appreciate it! |
Jefffrey
reviewed
Apr 21, 2026
Contributor
Jefffrey
left a comment
There was a problem hiding this comment.
I tried commenting out the new match arms and running the newly added tests here and hit a failure:
failures:
---- cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict stdout ----
thread 'cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict' (6965562) panicked at arrow-cast/src/cast/mod.rs:7379:9:
expected CastError, got InvalidArgumentError("Encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 5")
failures:
cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict
test result: FAILED. 342 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01sWe should look into this as this PR is meant to only add fast paths, so change in behaviour seems a little odd 🤔
| array.values().as_string::<i32>(), | ||
| ), | ||
| // Cross cast: Binary -> Utf8View requires UTF-8 validation of the dictionary values. | ||
| (Binary, Utf8View) => binary_dict_to_string_view::<K>( |
Contributor
There was a problem hiding this comment.
I feel this arm specifically should be benchmarked as it introduces new logic compared to the other arms
| // If the buffer is too large, fall back to the general path. | ||
| (LargeUtf8, Utf8View) => { | ||
| let values = array.values().as_string::<i64>(); | ||
| if values.values().len() < u32::MAX as usize { |
Contributor
There was a problem hiding this comment.
This check reads a little odd to me as usually this could mean unpack_dictionary may also fail if offsets don't fit?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
unpack_dictionaryhandled all Dictionary→View casts correctly but incurred an unnecessary copy of the values buffer on every cast. For Dictionary arrays with many repeated values (the common use case), this copies data for every logical row rather than once.A fast path already existed for
Utf8->Utf8ViewandBinary->BinaryViewviaview_from_dict_values, which reuses the values buffer zero-copy and only writes 16-byte view structs per row. This PR extends that to the remaining cases called out in the TODO comments.What changes are included in this PR?
(LargeUtf8, Utf8View)fast path indictionary_cast: reuses the values buffer zero-copy when i64 offsets fit inu32(buffer < 4 GiB), falls back tounpack_dictionarywhen the buffer is too large(LargeBinary, BinaryView)fast path with the same offset-fit check(Utf8, BinaryView)cross cast fast path: UTF-8 strings are always valid binary so the buffer is reused unconditionally(Binary, Utf8View)cross cast via newbinary_dict_to_string_view: validates UTF-8 of dictionary values and reuses the buffer zero-copy when all valid; respectsCastOptions::safe, nullifies rows pointing to invalid dictionary values whensafe=true, returnsCastErrorwhensafe=falseAre these changes tested?
Yes. Added 6 tests in
arrow-cast/src/cast/mod.rs:test_dict_large_utf8_to_utf8view->LargeUtf8->Utf8Viewfast path, including null keys and values longer than 12 bytes (buffered views)test_dict_large_binary_to_binary_view->LargeBinary->BinaryViewfast path, including null keystest_dict_utf8_to_binary_view->Utf8->BinaryViewcross casttest_dict_binary_to_utf8view_valid->Binary->Utf8Viewwhen all dictionary values are valid UTF-8 (zero-copy fast path)test_dict_binary_to_utf8view_invalid_utf8_strict->Binary->Utf8viewwith invalid UTF-8 andsafe=falsereturnsCastErrortest_dict_binary_to_utf8view_invalid_utf8_safe->Binary->Utf8Viewwith invalid UTF-8 andsafe=truenullifies every row whose key points to an invalid dictionary value, preserving valid rowsAre there any user-facing changes?
Yes. Casting
Dictionary<_, LargeUtf8>->Utf8View,Dictionary<_, LargeBinary>->BinaryView,Dictionary<_, Utf8>->BinaryView, andDictionary<_, Binary>->Utf8Viewis now significantly faster for large arrays with repeated values. The dictionary values buffer is reused without copying instead of being fully unpacked row-by-row.