perf: Optimize `split_part`, support `Utf8View` by neilconway · Pull Request #21119 · apache/datafusion

neilconway · 2026-03-23T15:55:09Z

Which issue does this PR close?

Rationale for this change

split_part currently accepts Utf8View but always returns Utf8. When given Utf8View input, it should instead return Utf8View output.

While we're at it, optimize split_part for single-character delimiters (the common case): str::split(&str) is significantly slower than str::split(char) for single-character ASCII delimiters, because the former uses a general string matching algorithm but the latter uses memchr::memchr.

Benchmark results (M4 Max):

utf8_single_char/pos_first: 142 µs → 104 µs (-26%)
utf8_single_char/pos_middle: 389 µs → 365 µs (-6%)
utf8_single_char/pos_negative: 154 µs → 109 µs (-29%)
utf8_multi_char/pos_middle: 356 µs → 361 µs (~0%, noise)
utf8view_single_char/pos_first: 143 µs → 111 µs (-22%)
utf8_long_strings/pos_middle: 2568 µs → 1984 µs (-23%)
utf8view_long_parts/pos_middle: 998 µs → 470 µs (-53%)

What changes are included in this PR?

Revise split_part benchmarks to reduce redundancy and improve Utf8View coverage
Support Utf8View -> Utf8View in split_part
Refactor split_part to cleanup some redundant code
Optimize split_part for single-character delimiters
Add SLT test coverage for split_part with Utf8View input

Are these changes tested?

Yes. New tests and benchmarks added.

Are there any user-facing changes?

No.

neilconway · 2026-03-23T15:58:17Z

split_part can be optimized further; probably scalar specialization would be a nice win. But I'd like to get this PR in first to make it easier to review.

…part

alamb

Thanks @neilconway and @martin-g

alamb · 2026-03-27T17:45:15Z

+/// Finds the nth split part of `string` by `delimiter`.
+#[inline]
+fn split_nth<'a>(string: &'a str, delimiter: &str, n: usize) -> Option<&'a str> {
+    if delimiter.len() == 1 {


As a follow on, we can probably hoist this check out of the loop (so call it once per batch rather than once per string) and see if that makes things any faster (sometimes it allows the compiler to optimize things better)

alamb · 2026-03-27T17:46:39Z

+                match args[1].data_type() {
+                    DataType::Utf8View => split_part_impl(
+                        $str_arr,
+                        &args[1].as_string_view(),


As a follow on, we can probably make this even faster with a special, no copy implementation for Utf8View (reuse the same string buffers, but just adjust the views)

## Which issue does this PR close? - Closes apache#21117. - Closes apache#21118 . ## Rationale for this change `split_part` currently accepts `Utf8View` but always returns `Utf8`. When given `Utf8View` input, it should instead return `Utf8View` output. While we're at it, optimize `split_part` for single-character delimiters (the common case): `str::split(&str)` is significantly slower than `str::split(char)` for single-character ASCII delimiters, because the former uses a general string matching algorithm but the latter uses `memchr::memchr`. Benchmark results (M4 Max): - `utf8_single_char/pos_first`: 142 µs → 104 µs (-26%) - `utf8_single_char/pos_middle`: 389 µs → 365 µs (-6%) - `utf8_single_char/pos_negative`: 154 µs → 109 µs (-29%) - `utf8_multi_char/pos_middle`: 356 µs → 361 µs (~0%, noise) - `utf8view_single_char/pos_first`: 143 µs → 111 µs (-22%) - `utf8_long_strings/pos_middle`: 2568 µs → 1984 µs (-23%) - `utf8view_long_parts/pos_middle`: 998 µs → 470 µs (-53%) ## What changes are included in this PR? * Revise `split_part` benchmarks to reduce redundancy and improve `Utf8View` coverage * Support `Utf8View` -> `Utf8View` in `split_part` * Refactor `split_part` to cleanup some redundant code * Optimize `split_part` for single-character delimiters * Add SLT test coverage for `split_part` with `Utf8View` input ## Are these changes tested? Yes. New tests and benchmarks added. ## Are there any user-facing changes? No.

neilconway added 3 commits March 23, 2026 11:44

Revise benchmarks for split_part

5467531

split_part: Optimize, cleanup, support utf8view

d044aff

Fix clippy

bad1a74

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 23, 2026

martin-g reviewed Mar 25, 2026

View reviewed changes

Comment thread datafusion/functions/src/string/split_part.rs Outdated

Comment thread datafusion/functions/benches/split_part.rs

neilconway added 4 commits March 25, 2026 09:29

with_capacity for GenericStringBuilder, per review comments

a2cc083

Merge remote-tracking branch 'origin/main' into neilc/optimize-split-…

592c210

…part

Add another benchmark scenario, per review comments

65833ac

Tweak new benchmark scenario

6c97764

martin-g approved these changes Mar 25, 2026

View reviewed changes

alamb approved these changes Mar 27, 2026

View reviewed changes

alamb added this pull request to the merge queue Mar 27, 2026

alamb added the performance Make DataFusion faster label Mar 27, 2026

Merged via the queue into apache:main with commit 9f893a4 Mar 27, 2026
31 checks passed

neilconway mentioned this pull request Mar 27, 2026

Optimize split_part for scalar args #21204

Closed

neilconway deleted the neilc/optimize-split-part branch March 29, 2026 13:19

neilconway mentioned this pull request Apr 2, 2026

perf: Optimize split_part for scalar args #21238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize `split_part`, support `Utf8View`#21119

perf: Optimize `split_part`, support `Utf8View`#21119
alamb merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-split-part

neilconway commented Mar 23, 2026 •

edited

Loading

Uh oh!

neilconway commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 27, 2026

Uh oh!

alamb Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neilconway commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Mar 23, 2026 •

edited

Loading