Skip to content

feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast#9768

Open
Abhisheklearn12 wants to merge 2 commits intoapache:mainfrom
Abhisheklearn12:feat/dict-view-fast-path-8985
Open

feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast#9768
Abhisheklearn12 wants to merge 2 commits intoapache:mainfrom
Abhisheklearn12:feat/dict-view-fast-path-8985

Conversation

@Abhisheklearn12
Copy link
Copy Markdown
Contributor

@Abhisheklearn12 Abhisheklearn12 commented Apr 19, 2026

Which issue does this PR close?

Rationale for this change

unpack_dictionary handled all Dictionary→View casts correctly but incurred an unnecessary copy of the values buffer on every cast. For Dictionary arrays with many repeated values (the common use case), this copies data for every logical row rather than once.

A fast path already existed for Utf8->Utf8View and Binary->BinaryView via view_from_dict_values, which reuses the values buffer zero-copy and only writes 16-byte view structs per row. This PR extends that to the remaining cases called out in the TODO comments.

What changes are included in this PR?

  • Add (LargeUtf8, Utf8View) fast path in dictionary_cast: reuses the values buffer zero-copy when i64 offsets fit in u32 (buffer < 4 GiB), falls back to unpack_dictionary when the buffer is too large
  • Add (LargeBinary, BinaryView) fast path with the same offset-fit check
  • Add (Utf8, BinaryView) cross cast fast path: UTF-8 strings are always valid binary so the buffer is reused unconditionally
  • Add (Binary, Utf8View) cross cast via new binary_dict_to_string_view: validates UTF-8 of dictionary values and reuses the buffer zero-copy when all valid; respects CastOptions::safe , nullifies rows pointing to invalid dictionary values when safe=true, returns CastError when safe=false

Are these changes tested?

Yes. Added 6 tests in arrow-cast/src/cast/mod.rs:

  • test_dict_large_utf8_to_utf8view-> LargeUtf8->Utf8View fast path, including null keys and values longer than 12 bytes (buffered views)
  • test_dict_large_binary_to_binary_view -> LargeBinary->BinaryView fast path, including null keys
  • test_dict_utf8_to_binary_view -> Utf8->BinaryView cross cast
  • test_dict_binary_to_utf8view_valid -> Binary->Utf8View when all dictionary values are valid UTF-8 (zero-copy fast path)
  • test_dict_binary_to_utf8view_invalid_utf8_strict -> Binary->Utf8view with invalid UTF-8 and safe=false returns CastError
  • test_dict_binary_to_utf8view_invalid_utf8_safe -> Binary->Utf8View with invalid UTF-8 and safe=true nullifies every row whose key points to an invalid dictionary value, preserving valid rows

Are there any user-facing changes?

Yes. Casting Dictionary<_, LargeUtf8>->Utf8View, Dictionary<_, LargeBinary>->BinaryView, Dictionary<_, Utf8>->BinaryView, and Dictionary<_, Binary>->Utf8View is now significantly faster for large arrays with repeated values. The dictionary values buffer is reused without copying instead of being fully unpacked row-by-row.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 19, 2026
@Abhisheklearn12
Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey, I’d love to get your feedback whenever you have time. Very appreciate it!

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried commenting out the new match arms and running the newly added tests here and hit a failure:

failures:

---- cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict stdout ----

thread 'cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict' (6965562) panicked at arrow-cast/src/cast/mod.rs:7379:9:
expected CastError, got InvalidArgumentError("Encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 5")


failures:
    cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict

test result: FAILED. 342 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

We should look into this as this PR is meant to only add fast paths, so change in behaviour seems a little odd 🤔

array.values().as_string::<i32>(),
),
// Cross cast: Binary -> Utf8View requires UTF-8 validation of the dictionary values.
(Binary, Utf8View) => binary_dict_to_string_view::<K>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this arm specifically should be benchmarked as it introduces new logic compared to the other arms

// If the buffer is too large, fall back to the general path.
(LargeUtf8, Utf8View) => {
let values = array.values().as_string::<i64>();
if values.values().len() < u32::MAX as usize {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check reads a little odd to me as usually this could mean unpack_dictionary may also fail if offsets don't fit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fast path for Dictionary -> View cast for large types & cross cast

3 participants