Skip files outside partition structure in hive-partitioned listing tables#21756
Merged
zhuqi-lucas merged 2 commits intoapache:mainfrom Apr 22, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Updates hive-partitioned listing table file discovery to ignore files that don’t conform to the expected col=value/ partition directory structure, preventing crashes when partition columns are referenced.
Changes:
- Change
try_into_partitioned_fileto returnOk(None)for paths that don’t parse as valid hive partition paths. - Update
pruned_partition_listto skip such files usingtry_filter_map. - Add unit tests covering valid partitions and several “skip” scenarios (root file, wrong partition name, partial partitions).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
datafusion/catalog-listing/src/helpers.rs |
Skips non-partition-structured files during partitioned listing; adds unit tests. |
datafusion/catalog-listing/Cargo.toml |
Adds chrono as a dev-dependency for new tests constructing ObjectMeta. |
Comments suppressed due to low confidence (1)
datafusion/catalog-listing/src/helpers.rs:369
parse_partitions_for_pathcan returnSomewith fewer values thanpartition_colswhen the file path has fewer matchingkey=valuesegments than the number of partition columns (e.g. a root-level file namedyear=2024while expectingyear,month). In that case this code will build aPartitionedFilewith a shorterpartition_valuesvector, and laterfilter_partitionswill error when building aRecordBatchwith a schema containing more fields than arrays. Consider validatingparsed.len() == partition_cols.len()(and skipping/logging otherwise) before constructingpartition_values.
let partition_values = parsed
.into_iter()
.zip(partition_cols)
.map(|(parsed, (_, datatype))| {
ScalarValue::try_from_string(parsed.to_string(), datatype)
})
.collect::<Result<Vec<_>>>()?;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2a47d54 to
d2c88c7
Compare
d2c88c7 to
2066b74
Compare
…bles When a hive-partitioned listing table contains files in the root directory (not inside any `partition_col=value/` path), these files have no partition values. Previously `try_into_partitioned_file` included them with empty `partition_values`, causing downstream errors like `Unable to get field named "partition_col"` when queries reference partition columns. This is a common scenario when a table transitions from non-partitioned to hive-partitioned storage — the original root file may still exist alongside the new partition directories. The fix returns `Ok(None)` for files that don't match the partition structure, and the caller skips them via `try_filter_map`. Tests: - `test_try_into_partitioned_file_valid_partition` — normal case - `test_try_into_partitioned_file_root_file_skipped` — root file - `test_try_into_partitioned_file_wrong_partition_name` — wrong col name - `test_try_into_partitioned_file_multiple_partitions` — multi-level - `test_try_into_partitioned_file_partial_partition_skipped` — incomplete
2066b74 to
fa905d0
Compare
xudong963
approved these changes
Apr 21, 2026
alamb
approved these changes
Apr 22, 2026
Contributor
alamb
left a comment
There was a problem hiding this comment.
Thanks @zhuqi-lucas and @xudong963
Contributor
Author
|
Thanks @xudong963 and @alamb for review! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #21755
Rationale
When a hive-partitioned listing table has files in the root directory (not inside any
partition_col=value/path), queries that reference partition columns crash withUnable to get field named "partition_col".This is a common scenario when a table transitions from non-partitioned to hive-partitioned storage — the original root file may still exist alongside the new partition directories.
What changes are included in this PR?
try_into_partitioned_filenow returnsOk(None)for files that don't match the partition structure (whereparse_partitions_for_pathreturnsNone). The caller skips them viatry_filter_map.Previously,
Nonefromparse_partitions_for_pathwas converted to emptypartition_valuesvia.into_iter().flatten(), causing downstream errors.Are these changes tested?
Yes, 5 unit tests added:
test_try_into_partitioned_file_valid_partition— normal casetest_try_into_partitioned_file_root_file_skipped— root file skippedtest_try_into_partitioned_file_wrong_partition_name— wrong partition col nametest_try_into_partitioned_file_multiple_partitions— multi-level partitionstest_try_into_partitioned_file_partial_partition_skipped— incomplete partition pathAre there any user-facing changes?
Files outside the hive partition structure are now silently skipped instead of causing query failures. This may change
COUNT(*)results if such files previously contributed rows (with empty partition values).