perf: Optimize `string_to_array` for scalar args by neilconway · Pull Request #21131 · apache/datafusion

neilconway · 2026-03-23T21:08:09Z

Which issue does this PR close?

Closes Optimize string_to_array for scalar args #21129.

Rationale for this change

When the delimiter (and null string, if supplied) are scalars, we can implement string_to_array more efficiently. In particular, we can construct a memmem::Finder and use it to search for delimiters more efficiently.

This PR implements this optimization; it also fixes a place where we were allocating an intermediate String for every character when the delimiter is NULL. (This isn't a common case but worth fixing.)

Benchmarks (M4 Max):

  single_char_delim/5:    34.8 µs  (was  61.1 µs)  -43%
  single_char_delim/20:  145.1 µs  (was 220.7 µs)  -34%
  single_char_delim/100: 679.4 µs  (was   1.04 ms) -35%

  multi_char_delim/5:    41.7 µs  (was  56.7 µs)  -27%
  multi_char_delim/20:  158.9 µs  (was 185.1 µs)  -14%
  multi_char_delim/100: 731.4 µs  (was 858.3 µs)  -15%

  with_null_str/5:    43.1 µs  (was  68.7 µs)  -37%
  with_null_str/20:  179.3 µs  (was 244.3 µs)  -27%
  with_null_str/100: 895.8 µs  (was   1.16 ms) -23%

  null_delim/5:    17.4 µs  (was  64.1 µs)  -73%
  null_delim/20:   63.0 µs  (was 233.4 µs)  -73%
  null_delim/100: 280.2 µs  (was   1.12 ms) -75%

  columnar_delim/5:    65.2 µs  (was  60.2 µs)  +8%
  columnar_delim/20:  217.2 µs  (was 224.1 µs)  -3%
  columnar_delim/100:   1.02 ms  (was   1.05 ms) -3%

What changes are included in this PR?

Add benchmark for string_to_array
Implement optimizations described above
Refactor columnar (fallback) path to get rid of a lot of type dispatch boilerplate
Improve SLT test coverage for the "columnar string, scalar other-args" case

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

neilconway · 2026-03-23T21:09:15Z

    Ok(())
 }

-/// String_to_array SQL function


I had to move some code around to group the string_to_array functions together in the file, sorry for the noisy diff.

neilconway · 2026-03-23T21:11:51Z

-                        list_builder.append(true);
-                    }
-                    (Some(string), None) => {
-                        string.chars().map(|c| c.to_string()).for_each(|c| {


This was the inefficient NULL delimiter handling mentioned in the PR summary.

comphead · 2026-03-26T21:20:39Z

+fn get_str_value(array: &ArrayRef, i: usize) -> Option<&str> {
+    if array.is_null(i) {
+        return None;
+    }
+    match array.data_type() {
+        Utf8 => Some(array.as_string::<i32>().value(i)),
+        LargeUtf8 => Some(array.as_string::<i64>().value(i)),
+        Utf8View => Some(array.as_string_view().value(i)),
+        other => {
+            debug_assert!(false, "unexpected type in get_str_value: {other:?}");
+            None
+        }
+    }
+}


Suggested change

fn get_str_value(array: &ArrayRef, i: usize) -> Option<&str> {

if array.is_null(i) {

return None;

}

match array.data_type() {

Utf8 => Some(array.as_string::<i32>().value(i)),

LargeUtf8 => Some(array.as_string::<i64>().value(i)),

Utf8View => Some(array.as_string_view().value(i)),

other => {

debug_assert!(false, "unexpected type in get_str_value: {other:?}");

None

}

}

}

#[inline]

fn get_str_value<O: OffsetSizeTrait>(

array: &GenericStringArray<O>,

i: usize,

) -> Option<&str> {

if array.is_null(i) {

None

} else {

Some(array.value(i))

}

}

Thanks for the suggestion! However, GenericStringArray doesn't handle StringViewArray; the current approach was overall the cleanest way I could think of, but let me know if you know of a better way to do this.

comphead

Thanks @neilconway I left some comments but i didn't check its performace

neilconway · 2026-03-26T22:27:33Z

@comphead Thanks for the review!

comphead

Thanks @neilconway sorry, I lost this ticket

## Which issue does this PR close? - Closes apache#21129. ## Rationale for this change When the delimiter (and null string, if supplied) are scalars, we can implement `string_to_array` more efficiently. In particular, we can construct a `memmem::Finder` and use it to search for delimiters more efficiently. This PR implements this optimization; it also fixes a place where we were allocating an intermediate `String` for every character when the delimiter is `NULL`. (This isn't a common case but worth fixing.) Benchmarks (M4 Max): ``` single_char_delim/5: 34.8 µs (was 61.1 µs) -43% single_char_delim/20: 145.1 µs (was 220.7 µs) -34% single_char_delim/100: 679.4 µs (was 1.04 ms) -35% multi_char_delim/5: 41.7 µs (was 56.7 µs) -27% multi_char_delim/20: 158.9 µs (was 185.1 µs) -14% multi_char_delim/100: 731.4 µs (was 858.3 µs) -15% with_null_str/5: 43.1 µs (was 68.7 µs) -37% with_null_str/20: 179.3 µs (was 244.3 µs) -27% with_null_str/100: 895.8 µs (was 1.16 ms) -23% null_delim/5: 17.4 µs (was 64.1 µs) -73% null_delim/20: 63.0 µs (was 233.4 µs) -73% null_delim/100: 280.2 µs (was 1.12 ms) -75% columnar_delim/5: 65.2 µs (was 60.2 µs) +8% columnar_delim/20: 217.2 µs (was 224.1 µs) -3% columnar_delim/100: 1.02 ms (was 1.05 ms) -3% ``` ## What changes are included in this PR? * Add benchmark for `string_to_array` * Implement optimizations described above * Refactor columnar (fallback) path to get rid of a lot of type dispatch boilerplate * Improve SLT test coverage for the "columnar string, scalar other-args" case ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

neilconway added 2 commits March 23, 2026 15:17

Add benchmark for string_to_array

c53a9dd

Optimize string_to_array

828223e

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 23, 2026

neilconway commented Mar 23, 2026

View reviewed changes

neilconway added 3 commits March 23, 2026 17:12

cargo fmt

a389480

Merge branch 'main' into neilc/optimize-string-to-array

f0524b5

Fix TOML formatting

0c8d00f

comphead reviewed Mar 26, 2026

View reviewed changes

Comment thread datafusion/functions-nested/src/string.rs

comphead reviewed Mar 26, 2026

View reviewed changes

Comment thread datafusion/functions-nested/src/string.rs Outdated

comphead reviewed Mar 26, 2026

View reviewed changes

neilconway added 2 commits March 26, 2026 18:19

Refactor per code review suggestion

b849dc0

cargo fmt

fa1b678

comphead approved these changes Mar 31, 2026

View reviewed changes

comphead added this pull request to the merge queue Mar 31, 2026

Merged via the queue into apache:main with commit d87d8f6 Mar 31, 2026
33 checks passed

neilconway deleted the neilc/optimize-string-to-array branch March 31, 2026 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize `string_to_array` for scalar args#21131

perf: Optimize `string_to_array` for scalar args#21131
comphead merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-string-to-array

neilconway commented Mar 23, 2026

Uh oh!

neilconway Mar 23, 2026

Uh oh!

neilconway Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

comphead Mar 26, 2026

Uh oh!

neilconway Mar 26, 2026

Uh oh!

comphead left a comment

Uh oh!

neilconway commented Mar 26, 2026

Uh oh!

comphead left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neilconway commented Mar 23, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

neilconway Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

comphead Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

neilconway Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

neilconway commented Mar 26, 2026

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants