feat: support fixed length chunked dictionary for rangebitmap#167
feat: support fixed length chunked dictionary for rangebitmap#167fafacao86 wants to merge 3 commits intoalibaba:mainfrom
Conversation
|
I ran the generate_coverage.sh locally |
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp
Show resolved
Hide resolved
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp
Show resolved
Hide resolved
|
|
||
| // Try to add unsorted key should fail | ||
| auto result = appender->AppendSorted(Literal(15), 2); | ||
| EXPECT_FALSE(result.ok()); |
There was a problem hiding this comment.
Consider use ASSERT_NOK_WITH_MSG which can check error message.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h
Show resolved
Hide resolved
|
Great job! The code style and test coverage are both excellent — thank you for the high-quality contribution. |
d915551 to
2e3ee30
Compare
😸 |
|
Nice work! |
There was a problem hiding this comment.
Pull request overview
This PR adds a fixed-length, chunked dictionary implementation used by the RangeBitmap file index, including Java-compatible float/double ordering and literal (de)serialization utilities, along with unit tests.
Changes:
- Implement chunked dictionary core types (
Chunk,FixedLengthChunk,ChunkedDictionary) andKeyFactoryimplementations for fixed-length field types. - Add literal serialization utilities for fixed-length primitives and strings.
- Extend
DataInputStreamto supportReadValue<double>()and add comprehensive UT coverage for the new dictionary.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/common/io/data_input_stream.cpp | Adds explicit template instantiation for ReadValue<double>(). |
| src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h | Declares literal SerDe helpers for rangebitmap dictionary. |
| src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.cpp | Implements literal serialization/deserialization and size helpers. |
| src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h | Adds key factory interfaces and fixed-length factories (incl. float/double custom compare). |
| src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp | Implements factory creation, chunk creation/mmap, and Java-compatible float/double ordering. |
| src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.h | Defines a fixed-length chunk supporting lazy key loading and serialization. |
| src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.cpp | Implements fixed-length chunk read/write behavior and key access. |
| src/paimon/common/file_index/rangebitmap/dictionary/dictionary.h | Introduces dictionary interface and appender contract. |
| src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h | Declares chunked dictionary read/write API and appender state. |
| src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp | Implements dictionary binary search, lazy chunk loading, and serialization format. |
| src/paimon/common/file_index/rangebitmap/dictionary/chunk.h | Adds chunk abstraction with binary-search helpers. |
| src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp | Adds UT coverage for dictionary read/write, edge cases, and float/double ordering behaviors. |
| src/paimon/common/file_index/CMakeLists.txt | Adds new rangebitmap sources to paimon_file_index library build. |
| src/paimon/CMakeLists.txt | Registers the new rangebitmap dictionary UT in the test target list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp
Show resolved
Hide resolved
| // Create dictionary from bytes | ||
| auto input_stream = std::make_shared<ByteArrayInputStream>(bytes->data(), bytes->size()); | ||
| ASSERT_OK_AND_ASSIGN(auto dict, | ||
| ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_)); |
There was a problem hiding this comment.
TestEmptyDictionary serializes using a FieldType::FLOAT key factory, but deserializes with FieldType::INT. This makes the test less meaningful and could hide type-specific issues. Use the same field type in ChunkedDictionary::Create as was used to build/serialize the dictionary.
| ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_)); | |
| ChunkedDictionary::Create(FieldType::FLOAT, input_stream, 0, pool_)); |
| ::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit | ||
| ::testing::Values(1, 5, 20, 100, 666, 8888, 222222))); // cardinality |
There was a problem hiding this comment.
The parameterized test matrix includes very large cardinalities (e.g., 222,222) across multiple chunk sizes, which can make unit test runtime and memory usage disproportionately high and flaky in CI. Consider reducing the maximum cardinality and/or shrinking the parameter combinations (or gating the large cases behind a separate stress/perf test).
| ::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit | |
| ::testing::Values(1, 5, 20, 100, 666, 8888, 222222))); // cardinality | |
| ::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit | |
| ::testing::Values(1, 5, 20, 100, 666, 8888))); // cardinality |
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h
Show resolved
Hide resolved
| } | ||
| } | ||
| PAIMON_ASSIGN_OR_RAISE(std::shared_ptr<Chunk> prev_chunk, GetChunk(low - 1)); | ||
| return prev_chunk->Find(code); | ||
| } |
There was a problem hiding this comment.
ChunkedDictionary::Find(int32_t code) can call GetChunk(low - 1) with low == 0 (e.g., empty dictionary or code smaller than the first chunk code), which will produce an invalid chunk index error unrelated to the requested code. Handle size_ == 0 and the low == 0 case explicitly (return a clear invalid-code status).
There was a problem hiding this comment.
Addressed. This makes Invalid Result Message clearer.
| KeyFactory::Create(field_type)); | ||
| auto result = std::unique_ptr<ChunkedDictionary>(new ChunkedDictionary( | ||
| input_stream, factory_shared, size, offsets_length, chunks_length, | ||
| static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool)); |
There was a problem hiding this comment.
ChunkedDictionary::Create computes body_offset as static_cast<int32_t>(offset + header_length + sizeof(int32_t)), which can overflow for large files/offsets and defeats the use of int64_t offsets elsewhere. Keep this as int64_t (no narrowing cast).
| static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool)); | |
| offset + static_cast<int64_t>(header_length) + static_cast<int64_t>(sizeof(int32_t)), pool)); |
There was a problem hiding this comment.
lengths in rangebitmap headers are all int32. I think there is no need to cast to int64.
Purpose
Linked issue: close #146
Tests
UT tests in
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp-infinity < -0.0 < +0.0 < +infinity < NaN == NaNAPI and Format
Documentation
I think there won't be precision errors with float/double, because we will not do any
+ - * /calculations on floats.We only take the existing data of a float type column, and do
==,>,<on them, there is no rounding happening.But there is a difference regarding float/double ordering about
+0.0,-0.0,NaN. I did some special handling in KeyFactory for this. Feel free to correct me @lxy-9602 .Java implementation of ChunkedDictionray uses java.util.Comparator which follows the ordering of
-infinity < -0.0 < +0.0 < +infinity < NaN == NaN.