feat: support fixed length chunked dictionary for rangebitmap by fafacao86 · Pull Request #167 · alibaba/paimon-cpp

fafacao86 · 2026-03-05T06:40:41Z

Purpose

Linked issue: close #146

Tests

UT tests in src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

Functional tests: single chunk, multi chunk dictionary read and write. different types dictionary read and write.
Edge cases: empty dictionary, chunk_size_limit set to 0.
Floating point: java compatible ordering -infinity < -0.0 < +0.0 < +infinity < NaN == NaN
Parameterized tests for float/double, using random generated data and different combination of chunk size and cardinality.

API and Format

Documentation

I think there won't be precision errors with float/double, because we will not do any + - * / calculations on floats.
We only take the existing data of a float type column, and do ==, >, < on them, there is no rounding happening.
But there is a difference regarding float/double ordering about +0.0,-0.0,NaN. I did some special handling in KeyFactory for this. Feel free to correct me @lxy-9602 .
Java implementation of ChunkedDictionray uses java.util.Comparator which follows the ordering of
-infinity < -0.0 < +0.0 < +infinity < NaN == NaN.

fafacao86 · 2026-03-05T06:43:18Z

I ran the generate_coverage.sh locally
common/file_index/rangebitmap/dictionary
98.5%98.5%
98.5 % 747 / 758 96.6 % 170 / 176
common/file_index/rangebitmap/utils
100.0%
100.0 % 145 / 145 100.0 % 21 / 21

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp

src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

lxy-9602 · 2026-03-05T10:47:18Z

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

+
+    // Try to add unsorted key should fail
+    auto result = appender->AppendSorted(Literal(15), 2);
+    EXPECT_FALSE(result.ok());


Consider use ASSERT_NOK_WITH_MSG which can check error message.

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h

lxy-9602 · 2026-03-05T10:55:13Z

Great job! The code style and test coverage are both excellent — thank you for the high-quality contribution.

fafacao86 · 2026-03-05T11:56:15Z

Great job! The code style and test coverage are both excellent — thank you for the high-quality contribution.

😸

lucasfang · 2026-03-06T00:54:21Z

Nice work!

Copilot

Pull request overview

This PR adds a fixed-length, chunked dictionary implementation used by the RangeBitmap file index, including Java-compatible float/double ordering and literal (de)serialization utilities, along with unit tests.

Changes:

Implement chunked dictionary core types (Chunk, FixedLengthChunk, ChunkedDictionary) and KeyFactory implementations for fixed-length field types.
Add literal serialization utilities for fixed-length primitives and strings.
Extend DataInputStream to support ReadValue<double>() and add comprehensive UT coverage for the new dictionary.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/paimon/common/io/data_input_stream.cpp	Adds explicit template instantiation for `ReadValue<double>()`.
src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h	Declares literal SerDe helpers for rangebitmap dictionary.
src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.cpp	Implements literal serialization/deserialization and size helpers.
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h	Adds key factory interfaces and fixed-length factories (incl. float/double custom compare).
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp	Implements factory creation, chunk creation/mmap, and Java-compatible float/double ordering.
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.h	Defines a fixed-length chunk supporting lazy key loading and serialization.
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.cpp	Implements fixed-length chunk read/write behavior and key access.
src/paimon/common/file_index/rangebitmap/dictionary/dictionary.h	Introduces dictionary interface and appender contract.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h	Declares chunked dictionary read/write API and appender state.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp	Implements dictionary binary search, lazy chunk loading, and serialization format.
src/paimon/common/file_index/rangebitmap/dictionary/chunk.h	Adds chunk abstraction with binary-search helpers.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp	Adds UT coverage for dictionary read/write, edge cases, and float/double ordering behaviors.
src/paimon/common/file_index/CMakeLists.txt	Adds new rangebitmap sources to `paimon_file_index` library build.
src/paimon/CMakeLists.txt	Registers the new rangebitmap dictionary UT in the test target list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

Copilot · 2026-03-06T00:59:17Z

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

+    // Create dictionary from bytes
+    auto input_stream = std::make_shared<ByteArrayInputStream>(bytes->data(), bytes->size());
+    ASSERT_OK_AND_ASSIGN(auto dict,
+                         ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_));


TestEmptyDictionary serializes using a FieldType::FLOAT key factory, but deserializes with FieldType::INT. This makes the test less meaningful and could hide type-specific issues. Use the same field type in ChunkedDictionary::Create as was used to build/serialize the dictionary.

Suggested change

ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_));

ChunkedDictionary::Create(FieldType::FLOAT, input_stream, 0, pool_));

Copilot · 2026-03-06T00:59:18Z

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

+    ::testing::Combine(::testing::Values(1, 16, 64, 128, 1024),                // chunk size limit
+                       ::testing::Values(1, 5, 20, 100, 666, 8888, 222222)));  // cardinality


The parameterized test matrix includes very large cardinalities (e.g., 222,222) across multiple chunk sizes, which can make unit test runtime and memory usage disproportionately high and flaky in CI. Consider reducing the maximum cardinality and/or shrinking the parameter combinations (or gating the large cases behind a separate stress/perf test).

Suggested change

::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit

::testing::Values(1, 5, 20, 100, 666, 8888, 222222))); // cardinality

::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit

::testing::Values(1, 5, 20, 100, 666, 8888))); // cardinality

src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h

Copilot · 2026-03-06T00:59:18Z

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp

+        }
+    }
+    PAIMON_ASSIGN_OR_RAISE(std::shared_ptr<Chunk> prev_chunk, GetChunk(low - 1));
+    return prev_chunk->Find(code);
+}


ChunkedDictionary::Find(int32_t code) can call GetChunk(low - 1) with low == 0 (e.g., empty dictionary or code smaller than the first chunk code), which will produce an invalid chunk index error unrelated to the requested code. Handle size_ == 0 and the low == 0 case explicitly (return a clear invalid-code status).

Addressed. This makes Invalid Result Message clearer.

Copilot · 2026-03-06T00:59:18Z

src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp

+                           KeyFactory::Create(field_type));
+    auto result = std::unique_ptr<ChunkedDictionary>(new ChunkedDictionary(
+        input_stream, factory_shared, size, offsets_length, chunks_length,
+        static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool));


ChunkedDictionary::Create computes body_offset as static_cast<int32_t>(offset + header_length + sizeof(int32_t)), which can overflow for large files/offsets and defeats the use of int64_t offsets elsewhere. Keep this as int64_t (no narrowing cast).

Suggested change

static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool));

offset + static_cast<int64_t>(header_length) + static_cast<int64_t>(sizeof(int32_t)), pool));

lengths in rangebitmap headers are all int32. I think there is no need to cast to int64.

src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.cpp

lxy-9602 reviewed Mar 5, 2026

View reviewed changes

feat: support fixed length chunked dictionary for rangebitmap

2e3ee30

fafacao86 force-pushed the rangebitmap-dictionary branch from d915551 to 2e3ee30 Compare March 5, 2026 11:55

lucasfang requested a review from Copilot March 6, 2026 00:52

Copilot started reviewing on behalf of lucasfang March 6, 2026 00:52 View session

Merge branch 'main' into rangebitmap-dictionary

6429275

Copilot AI reviewed Mar 6, 2026

View reviewed changes

fix: copilot review

f275797

	ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_));
	ChunkedDictionary::Create(FieldType::FLOAT, input_stream, 0, pool_));

		::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit
		::testing::Values(1, 5, 20, 100, 666, 8888, 222222))); // cardinality

	static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool));
	offset + static_cast<int64_t>(header_length) + static_cast<int64_t>(sizeof(int32_t)), pool));

Conversation

fafacao86 commented Mar 5, 2026

Purpose

Tests

API and Format

Documentation

Uh oh!

fafacao86 commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lxy-9602 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lxy-9602 commented Mar 5, 2026

Uh oh!

fafacao86 commented Mar 5, 2026

Uh oh!

lucasfang commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

fafacao86 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

fafacao86 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants