Skip to content

[Feature] Support for VARIANT data type #2873

@XuQianJin-Stars

Description

@XuQianJin-Stars

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Background

Semi-structured data (e.g., JSON) is increasingly common in modern data pipelines. Many query engines and storage systems (such as Apache Spark, Apache Iceberg, and Apache Paimon) have adopted a VARIANT data type to efficiently represent and query semi-structured data using a compact binary encoding, rather than storing raw JSON strings.

Currently, Fluss treats VARIANT internally as plain byte[], which has several limitations:

  1. Loss of semantic structure: A single byte[] conflates the variant's value and metadata (string dictionary) into one opaque blob. Downstream consumers must know the internal wire format ([4-byte value length][value bytes][metadata bytes]) to decode it correctly.
  2. Inconsistent API: All other complex types in Fluss (e.g., InternalArray, InternalMap, InternalRow) have dedicated first-class types in the row infrastructure, while VARIANT does not.
  3. Poor interoperability with lake formats: When writing to lake formats (Paimon, Iceberg, Lance), the VARIANT data must be split into separate value and metadata components. Using byte[] forces every integration point to re-implement the split/merge logic.
  4. No alignment with industry standards: Apache Paimon has already introduced a full Variant interface with value() and metadata() accessors, following the Variant Binary Encoding spec. Fluss should align with this design for ecosystem consistency.

Use Case

  • Users ingesting JSON or semi-structured data into Fluss tables should benefit from efficient binary encoding and per-path access without full deserialization.
  • Lake connector writers (Paimon, Iceberg, Lance) need structured access to value and metadata separately.
  • A first-class Variant type enables future optimizations like predicate pushdown on variant paths.

Solution

Proposed Design

Introduce a first-class Variant interface and GenericVariant implementation throughout Fluss's row infrastructure, following the same pattern as Apache Paimon's Variant design.

1. Core Types

  • Variant interface (fluss-common/.../row/Variant.java)

    • byte[] value() — returns the binary-encoded variant value (header + data)
    • byte[] metadata() — returns the string dictionary (version + deduplicated object key names)
    • long sizeInBytes() — total byte size
    • Variant copy() — deep copy
    • Static helpers: bytesToVariant(byte[]) and variantToBytes(Variant) for backward-compatible wire format conversion
  • GenericVariant class (fluss-common/.../row/GenericVariant.java)

    • Implements Variant and Serializable
    • Stores two byte[] fields: value and metadata
    • Proper equals(), hashCode(), toString()

2. Row Infrastructure Changes

Layer Change
DataGetters Add Variant getVariant(int pos)
BinaryWriter Add writeVariant(int pos, Variant value)
All InternalRow implementations Implement getVariant()GenericRow, BinaryRow, CompactedRow, IndexedRow, ProjectedRow, PaddingRow, ColumnarRow, etc.
All InternalArray implementations Implement getVariant()GenericArray, BinaryArray, ColumnarArray
Readers/Writers CompactedRowReader/Writer, IndexedRowReader/Writer — add readVariant()/writeVariant(Variant)

3. Binary Storage Format (Backward Compatible)

The on-wire format remains unchanged for compatibility:
Variant.variantToBytes() and Variant.bytesToVariant() handle the conversion.

4. Integration Points

  • Lake connectors (Paimon, Iceberg, Lance): Encoders/decoders use Variant directly instead of raw byte[]
  • Flink bridge: FlussRowToFlinkRowConverter converts Variantbyte[] for Flink compatibility
  • Client converters: PojoToRowConverter / RowToPojoConverter support both byte[] and Variant inputs
  • Utilities: InternalRowUtils, TypeUtils, PartitionUtils updated accordingly

5. References

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions