Skip to content

Support Dictionary Arrays in MIN/MAX Aggregates#21315

Open
kosiew wants to merge 22 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150
Open

Support Dictionary Arrays in MIN/MAX Aggregates#21315
kosiew wants to merge 22 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 2, 2026

Which issue does this PR close?


Rationale for this change

The existing implementation of min/max does not correctly support dictionary-encoded arrays. Previously, dictionary arrays were handled by directly evaluating their underlying values array, which is semantically incorrect because:

  • It may include unreferenced values that do not appear in the logical dataset
  • It ignores nulls in the key array
  • It does not preserve dictionary key semantics in scalar results

This leads to incorrect aggregation results for dictionary types.

This PR introduces a logical row-based evaluation for dictionary arrays and ensures scalar comparisons correctly unwrap and rewrap dictionary values when needed.


What changes are included in this PR?

  • Add logical row-based min/max computation (scalar_row_extreme) for:

    • Dictionary arrays
    • Struct, List, LargeList, and FixedSizeList types
  • Replace previous dictionary handling that operated on values() with correct row-wise evaluation

  • Introduce requires_logical_row_scan to centralize fallback logic for complex types

  • Enhance scalar comparison logic:

    • Unwrap dictionary scalars before comparison
    • Rewrap results when both inputs are dictionaries with matching key types
    • Validate key type compatibility
  • Improve error messaging for incompatible scalar comparisons

  • Remove obsolete min_max_batch_generic implementation


Are these changes tested?

Yes. Comprehensive tests have been added to validate correctness across multiple scenarios:

  • Basic dictionary min/max behavior
  • Handling of null keys and null values
  • Ignoring unreferenced dictionary values
  • Multi-batch aggregation behavior
  • Float dictionary handling including NaN and infinities

These tests ensure correctness and guard against regressions.


Are there any user-facing changes?

Yes. The behavior of min and max on dictionary-encoded arrays is now:

  • Correct and semantically aligned with logical row values
  • Consistent with other data types

Previously incorrect results may now differ, which is a correctness fix rather than a breaking API change.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 4 commits April 2, 2026 16:32
Return ScalarValue::Dictionary(...) in dictionary batches instead of
unwrapping to inner scalars. Enhance min_max! logic to safely handle
dictionary-vs-dictionary and dictionary-vs-non-dictionary comparisons.
Add regression tests for raw-dictionary covering no-coercion,
null-containing, and multi-batch scenarios.
Centralize dictionary batch handling for min/max operations.
Streamline min_max_batch_generic to initialize from the first
non-null element. Implement shared setup/assert helpers in
dictionary tests to reduce repetition while preserving test
coverage.
Refactor dictionary min/max flow by removing the wrap macro arm,
making re-wrapping explicit through a private helper. This
separates the "choose inner winner" from the "wrap as
dictionary" step for easier auditing.

In `datafusion/functions-aggregate/src/min_max.rs`, update
`string_dictionary_batch` to accept slices instead of owned
Vecs, and introduce a small `evaluate_dictionary_accumulator`
helper to streamline min/max assertions with a shared
accumulator execution path, reducing repeated setup.
Update min_max.rs to ensure dictionary batches iterate
actual array rows, comparing referenced scalar values.
Unreferenced dictionary entries no longer affect MIN/MAX,
and referenced null values are correctly skipped.
Expanded tests to cover these changes and updated
expectations
Added regression tests for unreferenced and referenced
null dictionary values.
@github-actions github-actions Bot added core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions labels Apr 2, 2026
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from 08a20c4 to 5002677 Compare April 2, 2026 13:46
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) and removed logical-expr Logical plan and expressions labels Apr 2, 2026
@kosiew kosiew changed the title Support dictionary-encoded arrays in MIN/MAX aggregates and preserve dictionary types Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Apr 2, 2026
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from f2330b9 to b4938c1 Compare April 2, 2026 14:37
@github-actions github-actions Bot added logical-expr Logical plan and expressions and removed sqllogictest SQL Logic Tests (.slt) labels Apr 2, 2026
@kosiew kosiew changed the title Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering Apr 2, 2026
@github-actions github-actions Bot removed the logical-expr Logical plan and expressions label Apr 3, 2026
kosiew added 2 commits April 7, 2026 11:47
Consolidate row-wise min/max scan logic into a single helper
in min_max.rs to ensure consistency between dictionary
and generic complex-type paths. Add regression test for
the float dictionary handling NaN and -inf cases,
validating ordering semantics across batches.
Remove the no-op dictionary macro and single-use wrapper.
Collapse dictionary handling into a normalized path and seed
scalar_batch_extreme from the first non-null value.
Unify row-wise batch dispatch behind a shared predicate.
Apply formatting adjustments in min_max.rs as per cargo fmt.
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from 0e7a98d to dad6e02 Compare April 7, 2026 03:56
@github-actions github-actions Bot removed the core Core DataFusion crate label Apr 7, 2026
@kosiew kosiew marked this pull request as ready for review April 7, 2026 07:23
@Jefffrey Jefffrey changed the title Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering Support Dictionary Arrays in MIN/MAX Aggregates Apr 16, 2026
}};
}

fn scalar_batch_extreme(values: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this was renamed and refactored from min_max_batch_generic when the functionality seems equivalent? And to me this new version is less readable than the original

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor was mainly to fix dictionary semantics, not just to rename the helper. The old min_max_batch_generic logic was equivalent for Struct/List-like types, but it was not equivalent for Dictionary: the old DataType::Dictionary(_, _) => max_batch(values.as_any_dictionary().values()) / min_batch(...) path scanned the dictionary value buffer rather than the logical rows. That means it could consider unreferenced dictionary values and ignore null key positions, which is exactly what the new dictionary tests are guarding against.

scalar_batch_extreme uses ScalarValue::try_from_array(values, index) so it compares the logical row values instead of the backing dictionary values array. However, I agree the current naming can be improved. I'll rename it to scalar_row_extreme and add a short doc comment explaining why dictionaries need this path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this sounds accurate; the previous min_max_batch_generic already usesScalarValue::try_from_array. We can safely use the existing code of min_max_batch_generic for dictionary inputs without modification.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right.

min_max_batch_generic was already doing logical-row extraction correctly. The actual problem was only that dictionary arrays were not routed through it. I’ll simplify the change accordingly.

Comment thread datafusion/functions-aggregate-common/src/min_max.rs Outdated
Comment thread datafusion/functions-aggregate-common/src/min_max.rs Outdated

match lhs_key_type.zip(rhs_key_type) {
Some((key_type, _)) => {
ScalarValue::Dictionary(Box::new(key_type.clone()), Box::new(result))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there significance here to rhs key type always being ignored?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the intended accumulator flow there should not be any significance, because both sides should already have the same dictionary datatype: the accumulator state is created from the aggregate input type, and each batch result should come back with that same dictionary key type. In that happy path, preserving either side’s key type produces the same result.

But, the current code makes that assumption too implicitly.

Right now we unwrap both dictionary scalars, compare the inner values, and then re-wrap using the left key type whenever both sides are dictionaries. If the key types ever diverged unexpectedly, this would silently ignore the right-hand key type instead of surfacing an error. I think the safer behavior is to validate that both key types match and return an error if they do not, then re-wrap using the validated key type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we require the key types to be exact same? This doesn't seem to follow the new logic introduced that allows logically equivalent values to be compared 🤔

It seems like we're saying we are allowed to compare Float32 with Dictionary(Int32, Float32); but we cannot compare Dictionary(Int8, Float32) with Dictionary(Int32, Float32)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the inconsistency in the current framing.

I think the safest way to resolve it is not to broaden the “logically equivalent values” story, but to narrow the documented contract back to the cases the accumulator actually expects. In normal min/max aggregation, both dictionary scalars should come from the same aggregate input type, so differing dictionary key types are not a supported user-facing scenario so much as an unexpected invariant break.

So I would avoid presenting this as general mixed-type comparison support. Instead, I’d document that dictionary unwrapping exists to preserve logical value comparison for the expected accumulator flow, and I’d keep mismatched dictionary key types explicit as an internal-error case unless we intentionally decide to support cross-key-type dictionaries more broadly.

kosiew added 4 commits April 17, 2026 18:51
…ation

- Renamed the helper function from `scalar_batch_extreme` to `scalar_row_extreme`
- Added a doc comment explaining that this helper scans logical rows and the necessity of this for dictionary arrays (noting that comparing dictionary.values() is not semantically correct).
- Updated both call sites to use the new helper name in min/max batch logic at min_max.rs
- Renamed the predicate from `is_row_wise_batch_type` to `requires_logical_row_scan` at `min_max.rs
- Added a comment clarifying the strategy for handling primitive, string, and binary types using specialized kernels, while this set falls back to scalar row-by-row logical comparison.
- Updated both call sites to reflect the new name at `min_max.rs`
- Updated macro comment to explicitly document dictionary unwrapping behavior, stating that min/max operates on logically compatible scalar values (min_max.rs)
- Clarified catch-all error arm wording to emphasize logical incompatibility instead of enum-variant mismatch, changing it to "logically incompatible scalar values" (min_max.rs)
…andling

- Implemented explicit dictionary key-type validation for dictionary scalars in the macro dictionary branch
- Introduced an error for mismatched dictionary key types instead of silently preserving the left key type
- Updated macro comments to clarify that key types are validated before rewrapping
@kosiew kosiew marked this pull request as draft April 17, 2026 12:42
@kosiew kosiew marked this pull request as ready for review April 17, 2026 13:50
@kosiew kosiew requested a review from Jefffrey April 17, 2026 14:16
kosiew added 7 commits April 21, 2026 17:25
- Renamed the generic row-scan helper back to `min_max_batch_generic` at `min_max.rs:461`.
- Removed the `requires_logical_row_scan` indirection.
- Updated match arms at `min_max.rs:803` and `min_max.rs:855` to explicitly route `Struct`, `List`, `LargeList`, `FixedSizeList`, and `Dictionary` through `min_max_batch_generic`, including dictionary reuse as requested.
- Improved comparison behavior between dictionaries:
  - Same key type dictionaries now compare inner logical values and rewrap the result.
  - Different key type dictionaries raise an explicit internal error.
  - Comparisons between dictionaries and non-dictionary types now check inner logical values directly.
- Updated and tightened the macro comment to clarify mixed-type dictionary support limitations.
- Added focused unit tests for:
  - Dictionary vs scalar comparison
  - Same-key dictionary rewrapping
  - Mismatched dictionary key types
  - Incompatible dictionary and plain scalar comparisons
- Resolved non-local return issue in macro by moving scalar comparison logic to `min_max_scalar(...) -> Result<ScalarValue>`.
- Maintained `min_max!` as a thin wrapper for existing call sites.
- Updated dictionary error-path tests to directly call `min_max_scalar`, preserving behavior while enhancing testability for error cases.
…ionary(Int8, Utf8)

- Removed redundant `dictionary_inner_scalar_min_max` helper and invoked `min_max_scalar(...)` directly in the dictionary match arms to streamline code.
- Added end-to-end aggregate test for `Dictionary(Int8, Utf8)` via `test_min_max_dictionary_int8_keys`.
- Introduced a generic test helper for building string dictionaries with various key types, reducing setup duplication.
- Updated dictionary bindings to use explicit names:
  - `lhs_dict_key_type`
  - `rhs_dict_key_type`
  - `lhs_dict_value`
  - `rhs_dict_value`

- Renamed mixed-arm bindings to better distinguish between scalar and dictionary-inner values:
  - `rhs_scalar`
  - `lhs_scalar`
  - `rhs_dict_value`
  - `lhs_dict_value`

This improves consistency in reading each arm by role without introducing new errors.
…l matching

- Removed the call to min_max_scalar within min_max function.
- Implemented internal matching using choose_min_max!($OP).
- Updated return values to provide Ok with min or max based on Ordering result.
- Added unreachable!() for Ordering::Equal case for error handling.
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from 8ccd93b to 77a518e Compare April 21, 2026 11:55
Modified the import statements in the `min_max.rs` file to enhance readability by grouping related types together. This change organizes the code structure and follows standard conventions.
Comment on lines +825 to +831
/// Finds the min/max by scanning logical rows via `ScalarValue::try_from_array`.
///
/// This path is required for dictionary arrays because comparing
/// `dictionary.values()` is not semantically correct: it can include
/// unreferenced values and ignore null key positions.
fn min_max_batch_generic(values: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> {
let mut index = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand why this function is being changed. The provided reasoning makes no sense considering the original version already works as intended. Especially considering it is not the responsibility of this function that dictionary.values() isn't semantically correct; it is the responsibility of the caller, which is not being fixed by refactoring this function. Can we please revert the changes to this function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not the responsibility of this function that dictionary.values() isn't semantically correct

I amended the comment and locked this down with a new test.

..why this function is being changed....the original version already works as intended.

The changed min_max_batch_generic shape is a secondary improvement (independent of the bug fix): it first finds the first non-null row and only then enters comparison.

The two-phase version is easier to follow because it separates setup from comparison.

  1. Phase 1 finds the first non-null value.
  2. Phase 2 compares remaining non-null values against that baseline.

After phase 1, extreme is always non-null, so the loop only needs to:

  1. Skip null current values.
  2. Compare/update for non-null values.

The old single-loop mixed these concerns in every iteration (current null, extreme null, or compare), which made the logic more branchy and harder to read.

kosiew added 4 commits April 22, 2026 13:36
…g responsibility

- Clarified caller responsibility for dictionary routing in the helper docs.
- Maintained caller-side correctness routing in min_max.rs at lines 820 and 906.
- Added regression coverage with a new test in min_max.rs at line 996 to ensure min/max are computed from logical rows instead of raw dictionary values, including unreferenced entries and null-key positions.
… DictionaryArray

- Removed redundant commentary in `min_max.rs` regarding dictionary arrays.
- Improved handling of dictionary array values by explicitly extracting and processing raw values.
- Added assertions in unit tests to validate the results when operating on raw values from the dictionary.
…ues from DictionaryArray"

This reverts commit f1bffe32b5f7493725138bdab04b6f58e8db651a.
- Added `AsArray` import for `DictionaryArray` to facilitate raw values extraction.
- Updated tests to include assertions for raw minimum values from the dictionary array, improving test coverage and validation accuracy.
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from c6104a2 to e716c92 Compare April 22, 2026 06:15
@kosiew kosiew requested a review from Jefffrey April 22, 2026 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dictionary coercion on min/max

3 participants