Skip to content

[fix][cp] Read unannotated array#455

Draft
guhaiyan0221 wants to merge 3 commits intobytedance:mainfrom
guhaiyan0221:fix_cp_unannotated_array
Draft

[fix][cp] Read unannotated array#455
guhaiyan0221 wants to merge 3 commits intobytedance:mainfrom
guhaiyan0221:fix_cp_unannotated_array

Conversation

@guhaiyan0221
Copy link
Copy Markdown
Collaborator

@guhaiyan0221 guhaiyan0221 commented Mar 31, 2026

What problem does this PR solve?

Issue Number: close #191

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Summary:
When the element type is scalar type, a check in convertType requires the requested type must not be array. However, an unannotated array in Parquet is a repeated field that is not explicitly marked as a LIST logical type. To enable reading of unannotated arrays, this PR verifies the compatibility of their element types.

Follow-up for facebookincubator/velox#13620.

Corresponding PR: facebookincubator/velox#13864

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

Comment thread bolt/dwio/parquet/reader/ParquetReader.cpp
@guhaiyan0221 guhaiyan0221 force-pushed the fix_cp_unannotated_array branch from fb4b7d5 to 73810c5 Compare April 15, 2026 11:21
Copy link
Copy Markdown
Collaborator

@Weixin-Xu Weixin-Xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rui-mo and others added 2 commits April 16, 2026 20:51
Summary:
When the element type is scalar type, a check in `convertType` requires the
requested type must not be array. However, an unannotated array in Parquet
is a repeated field that is not explicitly marked as a LIST logical type.
To enable reading of unannotated arrays, this PR verifies the compatibility of
their element types.

Follow-up for facebookincubator/velox#13620.

Corresponding PR: facebookincubator/velox#13864
@guhaiyan0221 guhaiyan0221 force-pushed the fix_cp_unannotated_array branch from 73810c5 to dc8a22d Compare April 16, 2026 12:51
@guhaiyan0221 guhaiyan0221 marked this pull request as draft April 16, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants