Skip to content

feat: support fixed length chunked dictionary for rangebitmap#167

Open
fafacao86 wants to merge 3 commits intoalibaba:mainfrom
fafacao86:rangebitmap-dictionary
Open

feat: support fixed length chunked dictionary for rangebitmap#167
fafacao86 wants to merge 3 commits intoalibaba:mainfrom
fafacao86:rangebitmap-dictionary

Conversation

@fafacao86
Copy link

Purpose

Linked issue: close #146

Tests

UT tests in src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp

  1. Functional tests: single chunk, multi chunk dictionary read and write. different types dictionary read and write.
  2. Edge cases: empty dictionary, chunk_size_limit set to 0.
  3. Floating point: java compatible ordering -infinity < -0.0 < +0.0 < +infinity < NaN == NaN
  4. Parameterized tests for float/double, using random generated data and different combination of chunk size and cardinality.

API and Format

Documentation

I think there won't be precision errors with float/double, because we will not do any + - * / calculations on floats.
We only take the existing data of a float type column, and do ==, >, < on them, there is no rounding happening.
But there is a difference regarding float/double ordering about +0.0,-0.0,NaN. I did some special handling in KeyFactory for this. Feel free to correct me @lxy-9602 .
Java implementation of ChunkedDictionray uses java.util.Comparator which follows the ordering of
-infinity < -0.0 < +0.0 < +infinity < NaN == NaN.

@fafacao86
Copy link
Author

I ran the generate_coverage.sh locally
common/file_index/rangebitmap/dictionary
98.5%98.5%
98.5 % 747 / 758 96.6 % 170 / 176
common/file_index/rangebitmap/utils
100.0%
100.0 % 145 / 145 100.0 % 21 / 21


// Try to add unsorted key should fail
auto result = appender->AppendSorted(Literal(15), 2);
EXPECT_FALSE(result.ok());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider use ASSERT_NOK_WITH_MSG which can check error message.

@lxy-9602
Copy link
Collaborator

lxy-9602 commented Mar 5, 2026

Great job! The code style and test coverage are both excellent — thank you for the high-quality contribution.

@fafacao86 fafacao86 force-pushed the rangebitmap-dictionary branch from d915551 to 2e3ee30 Compare March 5, 2026 11:55
@fafacao86
Copy link
Author

Great job! The code style and test coverage are both excellent — thank you for the high-quality contribution.

😸

@lucasfang
Copy link
Collaborator

Nice work!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a fixed-length, chunked dictionary implementation used by the RangeBitmap file index, including Java-compatible float/double ordering and literal (de)serialization utilities, along with unit tests.

Changes:

  • Implement chunked dictionary core types (Chunk, FixedLengthChunk, ChunkedDictionary) and KeyFactory implementations for fixed-length field types.
  • Add literal serialization utilities for fixed-length primitives and strings.
  • Extend DataInputStream to support ReadValue<double>() and add comprehensive UT coverage for the new dictionary.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/paimon/common/io/data_input_stream.cpp Adds explicit template instantiation for ReadValue<double>().
src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h Declares literal SerDe helpers for rangebitmap dictionary.
src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.cpp Implements literal serialization/deserialization and size helpers.
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h Adds key factory interfaces and fixed-length factories (incl. float/double custom compare).
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp Implements factory creation, chunk creation/mmap, and Java-compatible float/double ordering.
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.h Defines a fixed-length chunk supporting lazy key loading and serialization.
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.cpp Implements fixed-length chunk read/write behavior and key access.
src/paimon/common/file_index/rangebitmap/dictionary/dictionary.h Introduces dictionary interface and appender contract.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h Declares chunked dictionary read/write API and appender state.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp Implements dictionary binary search, lazy chunk loading, and serialization format.
src/paimon/common/file_index/rangebitmap/dictionary/chunk.h Adds chunk abstraction with binary-search helpers.
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary_test.cpp Adds UT coverage for dictionary read/write, edge cases, and float/double ordering behaviors.
src/paimon/common/file_index/CMakeLists.txt Adds new rangebitmap sources to paimon_file_index library build.
src/paimon/CMakeLists.txt Registers the new rangebitmap dictionary UT in the test target list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

// Create dictionary from bytes
auto input_stream = std::make_shared<ByteArrayInputStream>(bytes->data(), bytes->size());
ASSERT_OK_AND_ASSIGN(auto dict,
ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_));
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestEmptyDictionary serializes using a FieldType::FLOAT key factory, but deserializes with FieldType::INT. This makes the test less meaningful and could hide type-specific issues. Use the same field type in ChunkedDictionary::Create as was used to build/serialize the dictionary.

Suggested change
ChunkedDictionary::Create(FieldType::INT, input_stream, 0, pool_));
ChunkedDictionary::Create(FieldType::FLOAT, input_stream, 0, pool_));

Copilot uses AI. Check for mistakes.
Comment on lines +625 to +626
::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit
::testing::Values(1, 5, 20, 100, 666, 8888, 222222))); // cardinality
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameterized test matrix includes very large cardinalities (e.g., 222,222) across multiple chunk sizes, which can make unit test runtime and memory usage disproportionately high and flaky in CI. Consider reducing the maximum cardinality and/or shrinking the parameter combinations (or gating the large cases behind a separate stress/perf test).

Suggested change
::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit
::testing::Values(1, 5, 20, 100, 666, 8888, 222222))); // cardinality
::testing::Combine(::testing::Values(1, 16, 64, 128, 1024), // chunk size limit
::testing::Values(1, 5, 20, 100, 666, 8888))); // cardinality

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +78
}
}
PAIMON_ASSIGN_OR_RAISE(std::shared_ptr<Chunk> prev_chunk, GetChunk(low - 1));
return prev_chunk->Find(code);
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkedDictionary::Find(int32_t code) can call GetChunk(low - 1) with low == 0 (e.g., empty dictionary or code smaller than the first chunk code), which will produce an invalid chunk index error unrelated to the requested code. Handle size_ == 0 and the low == 0 case explicitly (return a clear invalid-code status).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. This makes Invalid Result Message clearer.

KeyFactory::Create(field_type));
auto result = std::unique_ptr<ChunkedDictionary>(new ChunkedDictionary(
input_stream, factory_shared, size, offsets_length, chunks_length,
static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool));
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkedDictionary::Create computes body_offset as static_cast<int32_t>(offset + header_length + sizeof(int32_t)), which can overflow for large files/offsets and defeats the use of int64_t offsets elsewhere. Keep this as int64_t (no narrowing cast).

Suggested change
static_cast<int32_t>(offset + header_length + sizeof(int32_t)), pool));
offset + static_cast<int64_t>(header_length) + static_cast<int64_t>(sizeof(int32_t)), pool));

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lengths in rangebitmap headers are all int32. I think there is no need to cast to int64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support RangeBitmap File Index

4 participants