Skip to content

perf: Optimize string_to_array for scalar args#21131

Merged
comphead merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-string-to-array
Mar 31, 2026
Merged

perf: Optimize string_to_array for scalar args#21131
comphead merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-string-to-array

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

When the delimiter (and null string, if supplied) are scalars, we can implement string_to_array more efficiently. In particular, we can construct a memmem::Finder and use it to search for delimiters more efficiently.

This PR implements this optimization; it also fixes a place where we were allocating an intermediate String for every character when the delimiter is NULL. (This isn't a common case but worth fixing.)

Benchmarks (M4 Max):

  single_char_delim/5:    34.8 µs  (was  61.1 µs)  -43%
  single_char_delim/20:  145.1 µs  (was 220.7 µs)  -34%
  single_char_delim/100: 679.4 µs  (was   1.04 ms) -35%

  multi_char_delim/5:    41.7 µs  (was  56.7 µs)  -27%
  multi_char_delim/20:  158.9 µs  (was 185.1 µs)  -14%
  multi_char_delim/100: 731.4 µs  (was 858.3 µs)  -15%

  with_null_str/5:    43.1 µs  (was  68.7 µs)  -37%
  with_null_str/20:  179.3 µs  (was 244.3 µs)  -27%
  with_null_str/100: 895.8 µs  (was   1.16 ms) -23%

  null_delim/5:    17.4 µs  (was  64.1 µs)  -73%
  null_delim/20:   63.0 µs  (was 233.4 µs)  -73%
  null_delim/100: 280.2 µs  (was   1.12 ms) -75%

  columnar_delim/5:    65.2 µs  (was  60.2 µs)  +8%
  columnar_delim/20:  217.2 µs  (was 224.1 µs)  -3%
  columnar_delim/100:   1.02 ms  (was   1.05 ms) -3%

What changes are included in this PR?

  • Add benchmark for string_to_array
  • Implement optimizations described above
  • Refactor columnar (fallback) path to get rid of a lot of type dispatch boilerplate
  • Improve SLT test coverage for the "columnar string, scalar other-args" case

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 23, 2026
Ok(())
}

/// String_to_array SQL function
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to move some code around to group the string_to_array functions together in the file, sorry for the noisy diff.

list_builder.append(true);
}
(Some(string), None) => {
string.chars().map(|c| c.to_string()).for_each(|c| {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the inefficient NULL delimiter handling mentioned in the PR summary.

Comment thread datafusion/functions-nested/src/string.rs
Comment thread datafusion/functions-nested/src/string.rs Outdated
Comment on lines +514 to +527
fn get_str_value(array: &ArrayRef, i: usize) -> Option<&str> {
if array.is_null(i) {
return None;
}
match array.data_type() {
Utf8 => Some(array.as_string::<i32>().value(i)),
LargeUtf8 => Some(array.as_string::<i64>().value(i)),
Utf8View => Some(array.as_string_view().value(i)),
other => {
debug_assert!(false, "unexpected type in get_str_value: {other:?}");
None
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn get_str_value(array: &ArrayRef, i: usize) -> Option<&str> {
if array.is_null(i) {
return None;
}
match array.data_type() {
Utf8 => Some(array.as_string::<i32>().value(i)),
LargeUtf8 => Some(array.as_string::<i64>().value(i)),
Utf8View => Some(array.as_string_view().value(i)),
other => {
debug_assert!(false, "unexpected type in get_str_value: {other:?}");
None
}
}
}
#[inline]
fn get_str_value<O: OffsetSizeTrait>(
array: &GenericStringArray<O>,
i: usize,
) -> Option<&str> {
if array.is_null(i) {
None
} else {
Some(array.value(i))
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! However, GenericStringArray doesn't handle StringViewArray; the current approach was overall the cleanest way I could think of, but let me know if you know of a better way to do this.

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neilconway I left some comments but i didn't check its performace

@neilconway
Copy link
Copy Markdown
Contributor Author

@comphead Thanks for the review!

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neilconway sorry, I lost this ticket

@comphead comphead added this pull request to the merge queue Mar 31, 2026
Merged via the queue into apache:main with commit d87d8f6 Mar 31, 2026
33 checks passed
@neilconway neilconway deleted the neilc/optimize-string-to-array branch March 31, 2026 16:27
Rich-T-kid pushed a commit to Rich-T-kid/datafusion that referenced this pull request Apr 21, 2026
## Which issue does this PR close?

- Closes apache#21129.

## Rationale for this change

When the delimiter (and null string, if supplied) are scalars, we can
implement `string_to_array` more efficiently. In particular, we can
construct a `memmem::Finder` and use it to search for delimiters more
efficiently.

This PR implements this optimization; it also fixes a place where we
were allocating an intermediate `String` for every character when the
delimiter is `NULL`. (This isn't a common case but worth fixing.)

Benchmarks (M4 Max):

```
  single_char_delim/5:    34.8 µs  (was  61.1 µs)  -43%
  single_char_delim/20:  145.1 µs  (was 220.7 µs)  -34%
  single_char_delim/100: 679.4 µs  (was   1.04 ms) -35%

  multi_char_delim/5:    41.7 µs  (was  56.7 µs)  -27%
  multi_char_delim/20:  158.9 µs  (was 185.1 µs)  -14%
  multi_char_delim/100: 731.4 µs  (was 858.3 µs)  -15%

  with_null_str/5:    43.1 µs  (was  68.7 µs)  -37%
  with_null_str/20:  179.3 µs  (was 244.3 µs)  -27%
  with_null_str/100: 895.8 µs  (was   1.16 ms) -23%

  null_delim/5:    17.4 µs  (was  64.1 µs)  -73%
  null_delim/20:   63.0 µs  (was 233.4 µs)  -73%
  null_delim/100: 280.2 µs  (was   1.12 ms) -75%

  columnar_delim/5:    65.2 µs  (was  60.2 µs)  +8%
  columnar_delim/20:  217.2 µs  (was 224.1 µs)  -3%
  columnar_delim/100:   1.02 ms  (was   1.05 ms) -3%
```

## What changes are included in this PR?

* Add benchmark for `string_to_array`
* Implement optimizations described above
* Refactor columnar (fallback) path to get rid of a lot of type dispatch
boilerplate
* Improve SLT test coverage for the "columnar string, scalar other-args"
case

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize string_to_array for scalar args

2 participants