feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast by Abhisheklearn12 · Pull Request #9768 · apache/arrow-rs

Abhisheklearn12 · 2026-04-19T12:06:43Z

Which issue does this PR close?

Closes Fast path for Dictionary -> View cast for large types & cross cast #8985

Rationale for this change

unpack_dictionary handled all Dictionary→View casts correctly but incurred an unnecessary copy of the values buffer on every cast. For Dictionary arrays with many repeated values (the common use case), this copies data for every logical row rather than once.

A fast path already existed for Utf8->Utf8View and Binary->BinaryView via view_from_dict_values, which reuses the values buffer zero-copy and only writes 16-byte view structs per row. This PR extends that to the remaining cases called out in the TODO comments.

What changes are included in this PR?

Add (LargeUtf8, Utf8View) fast path in dictionary_cast: reuses the values buffer zero-copy when i64 offsets fit in u32 (buffer < 4 GiB), falls back to unpack_dictionary when the buffer is too large
Add (LargeBinary, BinaryView) fast path with the same offset-fit check
Add (Utf8, BinaryView) cross cast fast path: UTF-8 strings are always valid binary so the buffer is reused unconditionally
Add (Binary, Utf8View) cross cast via new binary_dict_to_string_view: validates UTF-8 of dictionary values and reuses the buffer zero-copy when all valid; respects CastOptions::safe , nullifies rows pointing to invalid dictionary values when safe=true, returns CastError when safe=false

Are these changes tested?

Yes. Added 6 tests in arrow-cast/src/cast/mod.rs:

test_dict_large_utf8_to_utf8view-> LargeUtf8->Utf8View fast path, including null keys and values longer than 12 bytes (buffered views)
test_dict_large_binary_to_binary_view -> LargeBinary->BinaryView fast path, including null keys
test_dict_utf8_to_binary_view -> Utf8->BinaryView cross cast
test_dict_binary_to_utf8view_valid -> Binary->Utf8View when all dictionary values are valid UTF-8 (zero-copy fast path)
test_dict_binary_to_utf8view_invalid_utf8_strict -> Binary->Utf8view with invalid UTF-8 and safe=false returns CastError
test_dict_binary_to_utf8view_invalid_utf8_safe -> Binary->Utf8View with invalid UTF-8 and safe=true nullifies every row whose key points to an invalid dictionary value, preserving valid rows

Are there any user-facing changes?

Yes. Casting Dictionary<_, LargeUtf8>->Utf8View, Dictionary<_, LargeBinary>->BinaryView, Dictionary<_, Utf8>->BinaryView, and Dictionary<_, Binary>->Utf8View is now significantly faster for large arrays with repeated values. The dictionary values buffer is reused without copying instead of being fully unpacked row-by-row.

… and cross cast

Abhisheklearn12 · 2026-04-20T05:27:11Z

Hi @Jefffrey, I’d love to get your feedback whenever you have time. Very appreciate it!

Jefffrey

I tried commenting out the new match arms and running the newly added tests here and hit a failure:

failures:

---- cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict stdout ----

thread 'cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict' (6965562) panicked at arrow-cast/src/cast/mod.rs:7379:9:
expected CastError, got InvalidArgumentError("Encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 5")


failures:
    cast::tests::test_dict_binary_to_utf8view_invalid_utf8_strict

test result: FAILED. 342 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

We should look into this as this PR is meant to only add fast paths, so change in behaviour seems a little odd 🤔

Jefffrey · 2026-04-21T05:52:16Z

+            array.values().as_string::<i32>(),
+        ),
+        // Cross cast: Binary -> Utf8View requires UTF-8 validation of the dictionary values.
+        (Binary, Utf8View) => binary_dict_to_string_view::<K>(


I feel this arm specifically should be benchmarked as it introduces new logic compared to the other arms

Jefffrey · 2026-04-21T05:52:56Z

+        // If the buffer is too large, fall back to the general path.
+        (LargeUtf8, Utf8View) => {
+            let values = array.values().as_string::<i64>();
+            if values.values().len() < u32::MAX as usize {


This check reads a little odd to me as usually this could mean unpack_dictionary may also fail if offsets don't fit?

feat(arrow-cast): fast path for Dictionary->View cast for large types…

cdfedb4

… and cross cast

github-actions Bot added the arrow Changes to the arrow crate label Apr 19, 2026

fix: remove needless borrows in invalid UTF-8 test cases

ef75fcf

Jefffrey reviewed Apr 21, 2026

View reviewed changes

alamb added the performance label Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast#9768

feat(arrow-cast): fast path for Dictionary->View cast for large types and cross cast#9768
Abhisheklearn12 wants to merge 2 commits intoapache:mainfrom
Abhisheklearn12:feat/dict-view-fast-path-8985

Abhisheklearn12 commented Apr 19, 2026 •

edited

Loading

Uh oh!

Abhisheklearn12 commented Apr 20, 2026

Uh oh!

Jefffrey left a comment

Uh oh!

Jefffrey Apr 21, 2026

Uh oh!

Jefffrey Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Abhisheklearn12 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Abhisheklearn12 commented Apr 20, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Abhisheklearn12 commented Apr 19, 2026 •

edited

Loading