feat: add approx_top_k aggregate function by sesteves · Pull Request #20968 · apache/datafusion

sesteves · 2026-03-16T17:17:46Z

Which issue does this PR close?

Closes #20967.

Rationale

Finding the most frequently occurring values in a column is a very common analytical pattern — top error codes, most popular products, most active users, frequent URL paths, etc. The exact approach (GROUP BY value ORDER BY COUNT(*) DESC LIMIT k) requires materializing all distinct groups, which is memory-intensive and slow on high-cardinality columns.

approx_top_k solves this with a streaming approximation algorithm that operates in bounded memory, making it practical for large-scale analytics.

This function is already available in ClickHouse (topK) and via extensions in PostgreSQL and Druid, and is a natural addition to DataFusion's existing family of approximate aggregate functions (approx_distinct, approx_median, approx_percentile_cont).

What changes are included in this PR?

Core implementation (`datafusion/functions-aggregate/src/approx_top_k.rs`)

SpaceSavingSummary — Implements the Filtered Space-Saving algorithm (Metwally et al., 2005):
- Fixed-size counter summary with eviction of minimum-count items when full
- Alpha map for improved accuracy — tracks recently evicted items and filters low-frequency noise before it enters the main summary (same approach used by ClickHouse)
- CAPACITY_MULTIPLIER = 3 (matching ClickHouse's default) — the summary tracks k × 3 counters internally
- Merge support for parallel/distributed execution
- Binary serialization/deserialization for intermediate state transfer
ApproxTopK — AggregateUDFImpl with #[user_doc] documentation, supporting any hashable scalar type
ApproxTopKAccumulator — Accumulator with update_batch, merge_batch, evaluate (returns List(Struct({value: T, count: UInt64})) ordered by count descending), and state (serialized summary + data type)
9 unit tests covering basic operations, eviction, alpha map, merge, serialization, accumulator update/evaluate, merge_batch, and a distributed merge simulation

Registration (`datafusion/functions-aggregate/src/lib.rs`)

Added module, expr_fn re-export, and registration in all_default_aggregate_functions()

SQL logic tests (`datafusion/sqllogictest/test_files/approx_top_k.slt`)

8 tests: basic string/int, GROUP BY, k=1, NULL handling, error cases (invalid k), WHERE clause filtering

Documentation

docs/source/user-guide/sql/aggregate_functions.md — TOC entry + full documentation section
docs/source/user-guide/expressions.md — quick-reference table entry

Integration tests

Proto roundtrip (datafusion/proto/tests/cases/roundtrip_logical_plan.rs) — approx_top_k(col("a"), lit(3))
DataFrame (datafusion/core/tests/dataframe/dataframe_functions.rs) — test_fn_approx_top_k with insta::assert_snapshot!

Are these changes tested?

Yes:

9 unit tests in approx_top_k.rs (algorithm correctness, serialization, accumulator lifecycle, distributed merge)
8 SQL logic tests in approx_top_k.slt (end-to-end SQL behavior including edge cases and errors)
1 DataFrame integration test with snapshot assertion
1 proto roundtrip test

Are there any user-facing changes?

Yes — new approx_top_k aggregate function available in SQL and the DataFrame API:

-- Top 3 most frequent values
SELECT approx_top_k(url_path, 3) FROM access_logs;

-- Top 3 per group
SELECT region, approx_top_k(product_id, 3) FROM sales GROUP BY region;

// DataFrame API
use datafusion::functions_aggregate::approx_top_k::approx_top_k;
df.aggregate(vec![], vec![approx_top_k(col("url_path"), lit(5))])?;

jonathanc-n · 2026-03-23T15:04:49Z

cc @Dandandan @martin-g Should be good for another look

sesteves · 2026-03-30T10:42:47Z

Thanks for the review @Dandandan @martin-g , I believe I've addressed all of your feedback.

martin-g · 2026-03-30T12:12:04Z

Unfortunately, after the squash+force push I will have to review the whole PR again ...
Please don't squash your commits in a PR!
It just makes incremental PR reviews hard/impossible.
The merge will squash them anyway, with separate commit message for all commits.

sesteves · 2026-04-06T13:28:09Z

Unfortunately, after the squash+force push I will have to review the whole PR again ... Please don't squash your commits in a PR! It just makes incremental PR reviews hard/impossible. The merge with squash them anyway, with separate commit message for all commits.

Sorry about that, the squash slipped through accidentally!

martin-g · 2026-04-06T13:36:18Z

No problem!
It would be good a second person to review the PR anyway.

sesteves · 2026-04-20T09:13:07Z

Hi @martin-g and @Dandandan ! Friendly ping on this when you have a moment. I know you mentioned wanting a second person to review - let me know if there is someone specific I should tag, or if there is anything else I can do from my side to help get this across the finish line.

Dandandan · 2026-04-22T11:16:24Z

+#[derive(Debug, Clone)]
+struct SpaceSavingSummary {
+    counters: Vec<Counter>,
+    counter_map: HashMap<Vec<u8>, usize>,


Vec<u8>: this is quite wasteful to store in the map.

we can store the hash inline in the map using the HashTable API instead.

HashTable<u64, usize>

We can remove the hash from the counters
and look up the item using the usize pointer using counter_map.find.

(Look up the pattern in other places like aggregation / join ...)

This saves many allocations / expensive rehashing / memory usage (for larger k, lot's of replacements ...)

Good call! switched to HashTable<(u64, usize)> following the same pattern as group_values/row.rs and join_hash_map.rs. No key bytes cloned anymore, rebuild is zero-allocation, and double-hashing is gone.

Add a new approx_top_k(expression, k) aggregate function that returns the approximate top-k most frequent values with their estimated counts, using the Filtered Space-Saving algorithm. The implementation uses a capacity multiplier of 3 (matching ClickHouse's default) and includes an alpha map for improved accuracy by filtering low-frequency noise before it enters the main summary. Return type is List(Struct({value: T, count: UInt64})) ordered by count descending, where T matches the input column type. Closes apache#20967

- Replace HashMap<Vec<u8>, usize> with hashbrown::HashTable<(u64, usize)> to eliminate key cloning, double-hashing, and O(n) rebuild allocations - Fix merge() to use Parallel Space-Saving m1/m2 algorithm (min counter count correction) matching ClickHouse's SpaceSaving::merge() - Fix serialize() to select top-N counters and fold evicted ones into alpha_map when requested_capacity < counters.len() < target_capacity - Harden from_bytes() with validation: requested_capacity > 0, num_counters <= requested_capacity, alpha_map_size is power-of-two, and overflow-safe bounds checking throughout - Replace all raw += on count/error/alpha with saturating_add - Track item_heap_bytes incrementally for O(1) size() accounting - Pre-allocate serialize() output buffer - Add TIMEZONE_WILDCARD timestamp variants to signature - Remove unused k and data_type from state_fields/state/merge_batch - Replace all manual DataFusionError construction with exec_err!/plan_err! - Replace .unwrap() downcasts in update_batch with proper error returns - Fix test_accumulator_large_utf8_input to use DataType::LargeUtf8 - Update #[user_doc] to document approximate counts, NULL handling, supported types, and return shape - Use with_timezone() instead of arrow::compute::cast for timestamps - Adapt to upstream API changes (remove as_any from AggregateUDFImpl, use direct downcast_ref for Literal)

Dandandan · 2026-04-22T15:48:24Z

@sesteves

starts to look good!

Can you also remove the testing submodule commit / update? And fix remaining CI issues?

- Fix testing submodule pointer (reverted to match upstream main) - Fix Cargo.toml formatting (taplo) - Fix unresolved rustdoc link: [`size`] -> `size()`

github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions Changes to functions implementation labels Mar 16, 2026

sesteves force-pushed the feat/approx-top-k branch 3 times, most recently from 0a2f5f8 to 1465c70 Compare March 16, 2026 17:58

Dandandan reviewed Mar 17, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/src/approx_top_k.rs

Dandandan reviewed Mar 17, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/src/approx_top_k.rs

martin-g reviewed Mar 17, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/src/approx_top_k.rs Outdated

Comment thread datafusion/functions-aggregate/src/approx_top_k.rs Outdated

Comment thread datafusion/functions-aggregate/src/approx_top_k.rs Outdated

sesteves force-pushed the feat/approx-top-k branch 2 times, most recently from 7d12109 to aa7bc47 Compare March 17, 2026 13:48

martin-g reviewed Mar 24, 2026

View reviewed changes

sesteves force-pushed the feat/approx-top-k branch from aa7bc47 to b7a06ea Compare March 25, 2026 12:19

Dandandan reviewed Apr 22, 2026

View reviewed changes

sesteves added 2 commits April 22, 2026 15:59

sesteves force-pushed the feat/approx-top-k branch from b7a06ea to 3d1f008 Compare April 22, 2026 15:28

fix: resolve CI failures for approx_top_k

8d4f0f6

- Fix testing submodule pointer (reverted to match upstream main) - Fix Cargo.toml formatting (taplo) - Fix unresolved rustdoc link: [`size`] -> `size()`

Conversation

sesteves commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale

What changes are included in this PR?

Core implementation (datafusion/functions-aggregate/src/approx_top_k.rs)

Registration (datafusion/functions-aggregate/src/lib.rs)

SQL logic tests (datafusion/sqllogictest/test_files/approx_top_k.slt)

Documentation

Integration tests

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonathanc-n commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sesteves commented Mar 30, 2026

Uh oh!

martin-g commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesteves commented Apr 6, 2026

Uh oh!

martin-g commented Apr 6, 2026

Uh oh!

sesteves commented Apr 20, 2026

Uh oh!

Dandandan Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sesteves Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sesteves commented Mar 16, 2026 •

edited

Loading

Core implementation (`datafusion/functions-aggregate/src/approx_top_k.rs`)

Registration (`datafusion/functions-aggregate/src/lib.rs`)

SQL logic tests (`datafusion/sqllogictest/test_files/approx_top_k.slt`)

martin-g commented Mar 30, 2026 •

edited

Loading

Dandandan Apr 22, 2026 •

edited

Loading

Dandandan commented Apr 22, 2026 •

edited

Loading