Skip to content

perf: Optimize split_part, support Utf8View#21119

Merged
alamb merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-split-part
Mar 27, 2026
Merged

perf: Optimize split_part, support Utf8View#21119
alamb merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-split-part

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Mar 23, 2026

Which issue does this PR close?

Rationale for this change

split_part currently accepts Utf8View but always returns Utf8. When given Utf8View input, it should instead return Utf8View output.

While we're at it, optimize split_part for single-character delimiters (the common case): str::split(&str) is significantly slower than str::split(char) for single-character ASCII delimiters, because the former uses a general string matching algorithm but the latter uses memchr::memchr.

Benchmark results (M4 Max):

  • utf8_single_char/pos_first: 142 µs → 104 µs (-26%)
  • utf8_single_char/pos_middle: 389 µs → 365 µs (-6%)
  • utf8_single_char/pos_negative: 154 µs → 109 µs (-29%)
  • utf8_multi_char/pos_middle: 356 µs → 361 µs (~0%, noise)
  • utf8view_single_char/pos_first: 143 µs → 111 µs (-22%)
  • utf8_long_strings/pos_middle: 2568 µs → 1984 µs (-23%)
  • utf8view_long_parts/pos_middle: 998 µs → 470 µs (-53%)

What changes are included in this PR?

  • Revise split_part benchmarks to reduce redundancy and improve Utf8View coverage
  • Support Utf8View -> Utf8View in split_part
  • Refactor split_part to cleanup some redundant code
  • Optimize split_part for single-character delimiters
  • Add SLT test coverage for split_part with Utf8View input

Are these changes tested?

Yes. New tests and benchmarks added.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 23, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

split_part can be optimized further; probably scalar specialization would be a nice win. But I'd like to get this PR in first to make it easier to review.

Comment thread datafusion/functions/src/string/split_part.rs Outdated
Comment thread datafusion/functions/benches/split_part.rs
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neilconway and @martin-g

/// Finds the nth split part of `string` by `delimiter`.
#[inline]
fn split_nth<'a>(string: &'a str, delimiter: &str, n: usize) -> Option<&'a str> {
if delimiter.len() == 1 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on, we can probably hoist this check out of the loop (so call it once per batch rather than once per string) and see if that makes things any faster (sometimes it allows the compiler to optimize things better)

match args[1].data_type() {
DataType::Utf8View => split_part_impl(
$str_arr,
&args[1].as_string_view(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on, we can probably make this even faster with a special, no copy implementation for Utf8View (reuse the same string buffers, but just adjust the views)

@alamb alamb added this pull request to the merge queue Mar 27, 2026
@alamb alamb added the performance Make DataFusion faster label Mar 27, 2026
Merged via the queue into apache:main with commit 9f893a4 Mar 27, 2026
31 checks passed
@neilconway neilconway deleted the neilc/optimize-split-part branch March 29, 2026 13:19
Rich-T-kid pushed a commit to Rich-T-kid/datafusion that referenced this pull request Apr 21, 2026
## Which issue does this PR close?

- Closes apache#21117.
- Closes apache#21118 .

## Rationale for this change

`split_part` currently accepts `Utf8View` but always returns `Utf8`.
When given `Utf8View` input, it should instead return `Utf8View` output.

While we're at it, optimize `split_part` for single-character delimiters
(the common case): `str::split(&str)` is significantly slower than
`str::split(char)` for single-character ASCII delimiters, because the
former uses a general string matching algorithm but the latter uses
`memchr::memchr`.

Benchmark results (M4 Max):

  - `utf8_single_char/pos_first`: 142 µs → 104 µs (-26%)
  - `utf8_single_char/pos_middle`: 389 µs → 365 µs (-6%)
  - `utf8_single_char/pos_negative`: 154 µs → 109 µs (-29%)
  - `utf8_multi_char/pos_middle`: 356 µs → 361 µs (~0%, noise)
  - `utf8view_single_char/pos_first`: 143 µs → 111 µs (-22%)
  - `utf8_long_strings/pos_middle`: 2568 µs → 1984 µs (-23%)
  - `utf8view_long_parts/pos_middle`: 998 µs → 470 µs (-53%)

## What changes are included in this PR?

* Revise `split_part` benchmarks to reduce redundancy and improve
`Utf8View` coverage
* Support `Utf8View` -> `Utf8View` in `split_part`
* Refactor `split_part` to cleanup some redundant code
* Optimize `split_part` for single-character delimiters
* Add SLT test coverage for `split_part` with `Utf8View` input

## Are these changes tested?

Yes. New tests and benchmarks added.

## Are there any user-facing changes?

No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize split_part for single-character delimiters split_part should preserve Utf8View input

3 participants