perf: Optimize `split_part` for scalar args by neilconway · Pull Request #21238 · apache/datafusion

neilconway · 2026-03-29T16:22:08Z

Which issue does this PR close?

Closes Optimize split_part for scalar args #21204.

Rationale for this change

In practice, split_part(string, delimiter, position) is often invoked with constant values for delimiter and position. We can take advantage of that to hoist some conditional branches out of the per-row hot loop; more importantly, we can switch from using str::split to building a memchr::memmem::Finder and using it for each row. Building a Finder is relatively expensive but it's a clear win when we can amortize that one-time cost over an entire input batch.

Benchmarks (M4 Max):

scalar_utf8_single_char/pos_first: 105 µs → 41 µs, -61%
scalar_utf8_single_char/pos_middle: 358 µs → 97 µs, -73%
scalar_utf8_single_char/pos_negative: 110 µs → 46 µs, -58%
scalar_utf8_multi_char/pos_middle: 355 µs → 132 µs, -63%
scalar_utf8_long_strings/pos_middle: 1.97 ms → 1.11 ms, -43%
scalar_utf8view_long_parts/pos_middle: 467 µs → 169 µs, -63%
array_utf8_single_char/pos_middle: 351 µs → 357 µs, no change
array_utf8_multi_char/pos_middle: 366 µs → 357 µs, -2.6%

What changes are included in this PR?

Add benchmarks for split_part with scalar delimiter and position
Add new fast-path for split_part with scalar delimiter and position
Add SLT tests for split_part with scalar delimiter and position

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

neilconway · 2026-04-02T13:03:58Z

@martin-g Any interest in reviewing this PR? It's a follow-on to the initial split_work work that was done in #21119

martin-g · 2026-04-02T13:37:26Z

I'll review it later! Thanks for the ping!

neilconway · 2026-04-02T13:38:30Z

I'll review it later! Thanks for the ping!

Amazing, thank you!

…part-scalar

alamb · 2026-04-06T16:11:43Z

+            string_array.as_string_view(),
+            delimiter,
+            position,
+            StringViewBuilder::with_capacity(string_array.len()),


I think this implementation still copies strings for StringView -- however, you can probably just adjust the view portions if you want to avoid a copy

As another PR perhaps

Yep! I wanted to land this first, I'll take a look at avoiding copies for StringView shortly. I filed #21410 for this.

alamb · 2026-04-06T16:12:05Z

Thanks @martin-g and @neilconway

## Which issue does this PR close? - Closes apache#21204. ## Rationale for this change In practice, `split_part(string, delimiter, position)` is often invoked with constant values for `delimiter` and `position`. We can take advantage of that to hoist some conditional branches out of the per-row hot loop; more importantly, we can switch from using `str::split` to building a `memchr::memmem::Finder` and using it for each row. Building a `Finder` is relatively expensive but it's a clear win when we can amortize that one-time cost over an entire input batch. Benchmarks (M4 Max): - `scalar_utf8_single_char/pos_first`: 105 µs → 41 µs, -61% - `scalar_utf8_single_char/pos_middle`: 358 µs → 97 µs, -73% - `scalar_utf8_single_char/pos_negative`: 110 µs → 46 µs, -58% - `scalar_utf8_multi_char/pos_middle`: 355 µs → 132 µs, -63% - `scalar_utf8_long_strings/pos_middle`: 1.97 ms → 1.11 ms, -43% - `scalar_utf8view_long_parts/pos_middle`: 467 µs → 169 µs, -63% - `array_utf8_single_char/pos_middle`: 351 µs → 357 µs, no change - `array_utf8_multi_char/pos_middle`: 366 µs → 357 µs, -2.6% ## What changes are included in this PR? * Add benchmarks for `split_part` with scalar delimiter and position * Add new fast-path for `split_part` with scalar delimiter and position * Add SLT tests for `split_part` with scalar delimiter and position ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

perf: Optimize split_part for scalar args

a439ef7

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 29, 2026

neilconway added 2 commits March 29, 2026 12:24

cargo fmt

62733bc

Fix clippy

bef9f5f

martin-g reviewed Apr 2, 2026

View reviewed changes

neilconway added 3 commits April 2, 2026 20:04

Merge remote-tracking branch 'origin/main' into neilc/optimize-split-…

0b2ac9f

…part-scalar

Fixes, per code review

c4f4a5e

Make builder capacity more conservative, add comment

b5f05b4

martin-g approved these changes Apr 5, 2026

View reviewed changes

alamb reviewed Apr 6, 2026

View reviewed changes

alamb added the performance Make DataFusion faster label Apr 6, 2026

alamb added this pull request to the merge queue Apr 6, 2026

Merged via the queue into apache:main with commit 7fa7fe0 Apr 6, 2026
31 checks passed

neilconway mentioned this pull request Apr 6, 2026

Optimize split_part to avoid copying via StringView #21410

Closed

neilconway deleted the neilc/optimize-split-part-scalar branch April 6, 2026 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize `split_part` for scalar args#21238

perf: Optimize `split_part` for scalar args#21238
alamb merged 6 commits intoapache:mainfrom
neilconway:neilc/optimize-split-part-scalar

neilconway commented Mar 29, 2026 •

edited

Loading

Uh oh!

neilconway commented Apr 2, 2026

Uh oh!

martin-g commented Apr 2, 2026

Uh oh!

neilconway commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb Apr 6, 2026

Uh oh!

neilconway Apr 6, 2026

Uh oh!

alamb commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neilconway commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Apr 2, 2026

Uh oh!

martin-g commented Apr 2, 2026

Uh oh!

neilconway commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

neilconway Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Mar 29, 2026 •

edited

Loading