Support Dictionary Arrays in MIN/MAX Aggregates by kosiew · Pull Request #21315 · apache/datafusion

kosiew · 2026-04-02T09:19:10Z

Which issue does this PR close?

Closes Dictionary coercion on min/max #21150.

Rationale for this change

The existing implementation of min/max does not correctly support dictionary-encoded arrays. Previously, dictionary arrays were handled by directly evaluating their underlying values array, which is semantically incorrect because:

It may include unreferenced values that do not appear in the logical dataset
It ignores nulls in the key array
It does not preserve dictionary key semantics in scalar results

This leads to incorrect aggregation results for dictionary types.

This PR introduces a logical row-based evaluation for dictionary arrays and ensures scalar comparisons correctly unwrap and rewrap dictionary values when needed.

What changes are included in this PR?

Add logical row-based min/max computation (scalar_row_extreme) for:
- Dictionary arrays
- Struct, List, LargeList, and FixedSizeList types
Replace previous dictionary handling that operated on values() with correct row-wise evaluation
Introduce requires_logical_row_scan to centralize fallback logic for complex types
Enhance scalar comparison logic:
- Unwrap dictionary scalars before comparison
- Rewrap results when both inputs are dictionaries with matching key types
- Validate key type compatibility
Improve error messaging for incompatible scalar comparisons
Remove obsolete min_max_batch_generic implementation

Are these changes tested?

Yes. Comprehensive tests have been added to validate correctness across multiple scenarios:

Basic dictionary min/max behavior
Handling of null keys and null values
Ignoring unreferenced dictionary values
Multi-batch aggregation behavior
Float dictionary handling including NaN and infinities

These tests ensure correctness and guard against regressions.

Are there any user-facing changes?

Yes. The behavior of min and max on dictionary-encoded arrays is now:

Correct and semantically aligned with logical row values
Consistent with other data types

Previously incorrect results may now differ, which is a correctness fix rather than a breaking API change.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Return ScalarValue::Dictionary(...) in dictionary batches instead of unwrapping to inner scalars. Enhance min_max! logic to safely handle dictionary-vs-dictionary and dictionary-vs-non-dictionary comparisons. Add regression tests for raw-dictionary covering no-coercion, null-containing, and multi-batch scenarios.

Centralize dictionary batch handling for min/max operations. Streamline min_max_batch_generic to initialize from the first non-null element. Implement shared setup/assert helpers in dictionary tests to reduce repetition while preserving test coverage.

Refactor dictionary min/max flow by removing the wrap macro arm, making re-wrapping explicit through a private helper. This separates the "choose inner winner" from the "wrap as dictionary" step for easier auditing. In `datafusion/functions-aggregate/src/min_max.rs`, update `string_dictionary_batch` to accept slices instead of owned Vecs, and introduce a small `evaluate_dictionary_accumulator` helper to streamline min/max assertions with a shared accumulator execution path, reducing repeated setup.

Update min_max.rs to ensure dictionary batches iterate actual array rows, comparing referenced scalar values. Unreferenced dictionary entries no longer affect MIN/MAX, and referenced null values are correctly skipped. Expanded tests to cover these changes and updated expectations Added regression tests for unreferenced and referenced null dictionary values.

Consolidate row-wise min/max scan logic into a single helper in min_max.rs to ensure consistency between dictionary and generic complex-type paths. Add regression test for the float dictionary handling NaN and -inf cases, validating ordering semantics across batches.

Remove the no-op dictionary macro and single-use wrapper. Collapse dictionary handling into a normalized path and seed scalar_batch_extreme from the first non-null value. Unify row-wise batch dispatch behind a shared predicate. Apply formatting adjustments in min_max.rs as per cargo fmt.

Jefffrey · 2026-04-16T09:57:27Z

    }};
 }

+fn scalar_batch_extreme(values: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> {


I'm curious why this was renamed and refactored from min_max_batch_generic when the functionality seems equivalent? And to me this new version is less readable than the original

The refactor was mainly to fix dictionary semantics, not just to rename the helper. The old min_max_batch_generic logic was equivalent for Struct/List-like types, but it was not equivalent for Dictionary: the old DataType::Dictionary(_, _) => max_batch(values.as_any_dictionary().values()) / min_batch(...) path scanned the dictionary value buffer rather than the logical rows. That means it could consider unreferenced dictionary values and ignore null key positions, which is exactly what the new dictionary tests are guarding against.

scalar_batch_extreme uses ScalarValue::try_from_array(values, index) so it compares the logical row values instead of the backing dictionary values array. However, I agree the current naming can be improved. I'll rename it to scalar_row_extreme and add a short doc comment explaining why dictionaries need this path.

None of this sounds accurate; the previous min_max_batch_generic already usesScalarValue::try_from_array. We can safely use the existing code of min_max_batch_generic for dictionary inputs without modification.

You’re right.

min_max_batch_generic was already doing logical-row extraction correctly. The actual problem was only that dictionary arrays were not routed through it. I’ll simplify the change accordingly.

Jefffrey · 2026-04-16T10:01:19Z

+
+                match lhs_key_type.zip(rhs_key_type) {
+                    Some((key_type, _)) => {
+                        ScalarValue::Dictionary(Box::new(key_type.clone()), Box::new(result))


Is there significance here to rhs key type always being ignored?

In the intended accumulator flow there should not be any significance, because both sides should already have the same dictionary datatype: the accumulator state is created from the aggregate input type, and each batch result should come back with that same dictionary key type. In that happy path, preserving either side’s key type produces the same result.

But, the current code makes that assumption too implicitly.

Right now we unwrap both dictionary scalars, compare the inner values, and then re-wrap using the left key type whenever both sides are dictionaries. If the key types ever diverged unexpectedly, this would silently ignore the right-hand key type instead of surfacing an error. I think the safer behavior is to validate that both key types match and return an error if they do not, then re-wrap using the validated key type.

So now we require the key types to be exact same? This doesn't seem to follow the new logic introduced that allows logically equivalent values to be compared 🤔

It seems like we're saying we are allowed to compare Float32 with Dictionary(Int32, Float32); but we cannot compare Dictionary(Int8, Float32) with Dictionary(Int32, Float32)?

Thanks for pointing out the inconsistency in the current framing.

I think the safest way to resolve it is not to broaden the “logically equivalent values” story, but to narrow the documented contract back to the cases the accumulator actually expects. In normal min/max aggregation, both dictionary scalars should come from the same aggregate input type, so differing dictionary key types are not a supported user-facing scenario so much as an unexpected invariant break.

So I would avoid presenting this as general mixed-type comparison support. Instead, I’d document that dictionary unwrapping exists to preserve logical value comparison for the expected accumulator flow, and I’d keep mismatched dictionary key types explicit as an internal-error case unless we intentionally decide to support cross-key-type dictionaries more broadly.

…ation - Renamed the helper function from `scalar_batch_extreme` to `scalar_row_extreme` - Added a doc comment explaining that this helper scans logical rows and the necessity of this for dictionary arrays (noting that comparing dictionary.values() is not semantically correct). - Updated both call sites to use the new helper name in min/max batch logic at min_max.rs

- Renamed the predicate from `is_row_wise_batch_type` to `requires_logical_row_scan` at `min_max.rs - Added a comment clarifying the strategy for handling primitive, string, and binary types using specialized kernels, while this set falls back to scalar row-by-row logical comparison. - Updated both call sites to reflect the new name at `min_max.rs`

- Updated macro comment to explicitly document dictionary unwrapping behavior, stating that min/max operates on logically compatible scalar values (min_max.rs) - Clarified catch-all error arm wording to emphasize logical incompatibility instead of enum-variant mismatch, changing it to "logically incompatible scalar values" (min_max.rs)

…andling - Implemented explicit dictionary key-type validation for dictionary scalars in the macro dictionary branch - Introduced an error for mismatched dictionary key types instead of silently preserving the left key type - Updated macro comments to clarify that key types are validated before rewrapping

- Renamed the generic row-scan helper back to `min_max_batch_generic` at `min_max.rs:461`. - Removed the `requires_logical_row_scan` indirection. - Updated match arms at `min_max.rs:803` and `min_max.rs:855` to explicitly route `Struct`, `List`, `LargeList`, `FixedSizeList`, and `Dictionary` through `min_max_batch_generic`, including dictionary reuse as requested.

- Improved comparison behavior between dictionaries: - Same key type dictionaries now compare inner logical values and rewrap the result. - Different key type dictionaries raise an explicit internal error. - Comparisons between dictionaries and non-dictionary types now check inner logical values directly. - Updated and tightened the macro comment to clarify mixed-type dictionary support limitations. - Added focused unit tests for: - Dictionary vs scalar comparison - Same-key dictionary rewrapping - Mismatched dictionary key types - Incompatible dictionary and plain scalar comparisons

- Resolved non-local return issue in macro by moving scalar comparison logic to `min_max_scalar(...) -> Result<ScalarValue>`. - Maintained `min_max!` as a thin wrapper for existing call sites. - Updated dictionary error-path tests to directly call `min_max_scalar`, preserving behavior while enhancing testability for error cases.

…ionary(Int8, Utf8) - Removed redundant `dictionary_inner_scalar_min_max` helper and invoked `min_max_scalar(...)` directly in the dictionary match arms to streamline code. - Added end-to-end aggregate test for `Dictionary(Int8, Utf8)` via `test_min_max_dictionary_int8_keys`. - Introduced a generic test helper for building string dictionaries with various key types, reducing setup duplication.

- Updated dictionary bindings to use explicit names: - `lhs_dict_key_type` - `rhs_dict_key_type` - `lhs_dict_value` - `rhs_dict_value` - Renamed mixed-arm bindings to better distinguish between scalar and dictionary-inner values: - `rhs_scalar` - `lhs_scalar` - `rhs_dict_value` - `lhs_dict_value` This improves consistency in reading each arm by role without introducing new errors.

…l matching - Removed the call to min_max_scalar within min_max function. - Implemented internal matching using choose_min_max!($OP). - Updated return values to provide Ok with min or max based on Ordering result. - Added unreachable!() for Ordering::Equal case for error handling.

… handling

Modified the import statements in the `min_max.rs` file to enhance readability by grouping related types together. This change organizes the code structure and follows standard conventions.

Jefffrey · 2026-04-21T13:24:45Z

+/// Finds the min/max by scanning logical rows via `ScalarValue::try_from_array`.
+///
+/// This path is required for dictionary arrays because comparing
+/// `dictionary.values()` is not semantically correct: it can include
+/// unreferenced values and ignore null key positions.
+fn min_max_batch_generic(values: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> {
+    let mut index = 0;


I still do not understand why this function is being changed. The provided reasoning makes no sense considering the original version already works as intended. Especially considering it is not the responsibility of this function that dictionary.values() isn't semantically correct; it is the responsibility of the caller, which is not being fixed by refactoring this function. Can we please revert the changes to this function.

it is not the responsibility of this function that dictionary.values() isn't semantically correct

I amended the comment and locked this down with a new test.

..why this function is being changed....the original version already works as intended.

The changed min_max_batch_generic shape is a secondary improvement (independent of the bug fix): it first finds the first non-null row and only then enters comparison.

The two-phase version is easier to follow because it separates setup from comparison.

Phase 1 finds the first non-null value.

Phase 2 compares remaining non-null values against that baseline.

After phase 1, extreme is always non-null, so the loop only needs to:

Skip null current values.

Compare/update for non-null values.

The old single-loop mixed these concerns in every iteration (current null, extreme null, or compare), which made the logic more branchy and harder to read.

…g responsibility - Clarified caller responsibility for dictionary routing in the helper docs. - Maintained caller-side correctness routing in min_max.rs at lines 820 and 906. - Added regression coverage with a new test in min_max.rs at line 996 to ensure min/max are computed from logical rows instead of raw dictionary values, including unreferenced entries and null-key positions.

… DictionaryArray - Removed redundant commentary in `min_max.rs` regarding dictionary arrays. - Improved handling of dictionary array values by explicitly extracting and processing raw values. - Added assertions in unit tests to validate the results when operating on raw values from the dictionary.

…ues from DictionaryArray" This reverts commit f1bffe32b5f7493725138bdab04b6f58e8db651a.

- Added `AsArray` import for `DictionaryArray` to facilitate raw values extraction. - Updated tests to include assertions for raw minimum values from the dictionary array, improving test coverage and validation accuracy.

kosiew added 4 commits April 2, 2026 16:32

github-actions Bot added core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions labels Apr 2, 2026

kosiew force-pushed the dictionary-coercion-21150 branch from 08a20c4 to 5002677 Compare April 2, 2026 13:46

github-actions Bot added sqllogictest SQL Logic Tests (.slt) and removed logical-expr Logical plan and expressions labels Apr 2, 2026

kosiew changed the title ~~Support dictionary-encoded arrays in MIN/MAX aggregates and preserve dictionary types~~ Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Apr 2, 2026

kosiew force-pushed the dictionary-coercion-21150 branch from f2330b9 to b4938c1 Compare April 2, 2026 14:37

github-actions Bot added logical-expr Logical plan and expressions and removed sqllogictest SQL Logic Tests (.slt) labels Apr 2, 2026

kosiew changed the title ~~Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion~~ Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering Apr 2, 2026

kosiew mentioned this pull request Apr 3, 2026

TEST - troubleshoot #21315 #21334

Closed

github-actions Bot removed the logical-expr Logical plan and expressions label Apr 3, 2026

kosiew added 2 commits April 7, 2026 11:47

kosiew force-pushed the dictionary-coercion-21150 branch from 0e7a98d to dad6e02 Compare April 7, 2026 03:56

github-actions Bot removed the core Core DataFusion crate label Apr 7, 2026

kosiew marked this pull request as ready for review April 7, 2026 07:23

Jefffrey changed the title ~~Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering~~ Support Dictionary Arrays in MIN/MAX Aggregates Apr 16, 2026

Jefffrey reviewed Apr 16, 2026

View reviewed changes

kosiew added 4 commits April 17, 2026 18:51

kosiew marked this pull request as draft April 17, 2026 12:42

kosiew marked this pull request as ready for review April 17, 2026 13:50

kosiew requested a review from Jefffrey April 17, 2026 14:16

kosiew added 7 commits April 21, 2026 17:25

feat: reintroduce min_max_batch_generic function for dictionary array…

77a518e

… handling

kosiew force-pushed the dictionary-coercion-21150 branch from 8ccd93b to 77a518e Compare April 21, 2026 11:55

feat: reorder imports in min_max.rs for improved clarity

0b8592d

Modified the import statements in the `min_max.rs` file to enhance readability by grouping related types together. This change organizes the code structure and follows standard conventions.

Jefffrey reviewed Apr 21, 2026

View reviewed changes

kosiew added 4 commits April 22, 2026 13:36

Revert "feat(min_max): update min_max_batch_generic to handle raw val…

2669a30

…ues from DictionaryArray" This reverts commit f1bffe32b5f7493725138bdab04b6f58e8db651a.

kosiew force-pushed the dictionary-coercion-21150 branch from c6104a2 to e716c92 Compare April 22, 2026 06:15

kosiew requested a review from Jefffrey April 22, 2026 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Dictionary Arrays in MIN/MAX Aggregates#21315

Support Dictionary Arrays in MIN/MAX Aggregates#21315
kosiew wants to merge 22 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150

kosiew commented Apr 2, 2026 •

edited

Loading

Uh oh!

Jefffrey Apr 16, 2026

Uh oh!

kosiew Apr 17, 2026

Uh oh!

Jefffrey Apr 18, 2026

Uh oh!

kosiew Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

Jefffrey Apr 16, 2026

Uh oh!

kosiew Apr 17, 2026

Uh oh!

Jefffrey Apr 18, 2026

Uh oh!

kosiew Apr 21, 2026

Uh oh!

Jefffrey Apr 21, 2026

Uh oh!

kosiew Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kosiew commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kosiew commented Apr 2, 2026 •

edited

Loading