Skip to content

Skip files outside partition structure in hive-partitioned listing tables#21756

Merged
zhuqi-lucas merged 2 commits intoapache:mainfrom
zhuqi-lucas:fix/skip-non-partition-files
Apr 22, 2026
Merged

Skip files outside partition structure in hive-partitioned listing tables#21756
zhuqi-lucas merged 2 commits intoapache:mainfrom
zhuqi-lucas:fix/skip-non-partition-files

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #21755

Rationale

When a hive-partitioned listing table has files in the root directory (not inside any partition_col=value/ path), queries that reference partition columns crash with Unable to get field named "partition_col".

This is a common scenario when a table transitions from non-partitioned to hive-partitioned storage — the original root file may still exist alongside the new partition directories.

What changes are included in this PR?

try_into_partitioned_file now returns Ok(None) for files that don't match the partition structure (where parse_partitions_for_path returns None). The caller skips them via try_filter_map.

Previously, None from parse_partitions_for_path was converted to empty partition_values via .into_iter().flatten(), causing downstream errors.

Are these changes tested?

Yes, 5 unit tests added:

  • test_try_into_partitioned_file_valid_partition — normal case
  • test_try_into_partitioned_file_root_file_skipped — root file skipped
  • test_try_into_partitioned_file_wrong_partition_name — wrong partition col name
  • test_try_into_partitioned_file_multiple_partitions — multi-level partitions
  • test_try_into_partitioned_file_partial_partition_skipped — incomplete partition path

Are there any user-facing changes?

Files outside the hive partition structure are now silently skipped instead of causing query failures. This may change COUNT(*) results if such files previously contributed rows (with empty partition values).

Copilot AI review requested due to automatic review settings April 21, 2026 07:16
@github-actions github-actions Bot added the catalog Related to the catalog crate label Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates hive-partitioned listing table file discovery to ignore files that don’t conform to the expected col=value/ partition directory structure, preventing crashes when partition columns are referenced.

Changes:

  • Change try_into_partitioned_file to return Ok(None) for paths that don’t parse as valid hive partition paths.
  • Update pruned_partition_list to skip such files using try_filter_map.
  • Add unit tests covering valid partitions and several “skip” scenarios (root file, wrong partition name, partial partitions).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
datafusion/catalog-listing/src/helpers.rs Skips non-partition-structured files during partitioned listing; adds unit tests.
datafusion/catalog-listing/Cargo.toml Adds chrono as a dev-dependency for new tests constructing ObjectMeta.
Comments suppressed due to low confidence (1)

datafusion/catalog-listing/src/helpers.rs:369

  • parse_partitions_for_path can return Some with fewer values than partition_cols when the file path has fewer matching key=value segments than the number of partition columns (e.g. a root-level file named year=2024 while expecting year,month). In that case this code will build a PartitionedFile with a shorter partition_values vector, and later filter_partitions will error when building a RecordBatch with a schema containing more fields than arrays. Consider validating parsed.len() == partition_cols.len() (and skipping/logging otherwise) before constructing partition_values.
    let partition_values = parsed
        .into_iter()
        .zip(partition_cols)
        .map(|(parsed, (_, datatype))| {
            ScalarValue::try_from_string(parsed.to_string(), datatype)
        })
        .collect::<Result<Vec<_>>>()?;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/catalog-listing/src/helpers.rs Outdated
@zhuqi-lucas zhuqi-lucas force-pushed the fix/skip-non-partition-files branch 2 times, most recently from 2a47d54 to d2c88c7 Compare April 21, 2026 07:29
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Apr 21, 2026
@zhuqi-lucas zhuqi-lucas force-pushed the fix/skip-non-partition-files branch from d2c88c7 to 2066b74 Compare April 21, 2026 07:43
…bles

When a hive-partitioned listing table contains files in the root
directory (not inside any `partition_col=value/` path), these files
have no partition values. Previously `try_into_partitioned_file`
included them with empty `partition_values`, causing downstream
errors like `Unable to get field named "partition_col"` when queries
reference partition columns.

This is a common scenario when a table transitions from
non-partitioned to hive-partitioned storage — the original root
file may still exist alongside the new partition directories.

The fix returns `Ok(None)` for files that don't match the partition
structure, and the caller skips them via `try_filter_map`.

Tests:
- `test_try_into_partitioned_file_valid_partition` — normal case
- `test_try_into_partitioned_file_root_file_skipped` — root file
- `test_try_into_partitioned_file_wrong_partition_name` — wrong col name
- `test_try_into_partitioned_file_multiple_partitions` — multi-level
- `test_try_into_partitioned_file_partial_partition_skipped` — incomplete
@zhuqi-lucas zhuqi-lucas force-pushed the fix/skip-non-partition-files branch from 2066b74 to fa905d0 Compare April 21, 2026 07:46
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread datafusion/sqllogictest/test_files/listing_table_partitions.slt
@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

Thanks @xudong963 and @alamb for review!

@zhuqi-lucas zhuqi-lucas added this pull request to the merge queue Apr 22, 2026
Merged via the queue into apache:main with commit 5d508d3 Apr 22, 2026
34 checks passed
@zhuqi-lucas zhuqi-lucas deleted the fix/skip-non-partition-files branch April 22, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hive-partitioned listing table crashes when root directory contains non-partitioned files

4 participants