Add a memory bound FileStatisticsCache for the Listing Table#20047
Open
mkleen wants to merge 61 commits intoapache:mainfrom
Open
Add a memory bound FileStatisticsCache for the Listing Table#20047mkleen wants to merge 61 commits intoapache:mainfrom
mkleen wants to merge 61 commits intoapache:mainfrom
Conversation
a66420a to
3b33739
Compare
3b33739 to
8e5560b
Compare
e273afc to
b297378
Compare
kosiew
requested changes
Feb 4, 2026
59c6bce to
4542db8
Compare
Contributor
Author
|
@kosiew Thank you for the feedback! |
Contributor
Author
|
@kosiew Anything else needed to get this merged? Another approval maybe? |
205f96c to
92899a7
Compare
martin-g
reviewed
Feb 10, 2026
| impl<T: DFHeapSize> DFHeapSize for Arc<T> { | ||
| fn heap_size(&self) -> usize { | ||
| // Arc stores weak and strong counts on the heap alongside an instance of T | ||
| 2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size() |
Member
There was a problem hiding this comment.
This won't be accurate.
let a1 = Arc::new(vec![1, 2, 3]);
let a2 = a1.clone();
let a3 = a1.clone();
let a4 = a3.clone();
// this should be true because all `a`s point to the same object in memory
// but the current implementation does not detect this and counts them separately
assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.
Contributor
Author
There was a problem hiding this comment.
Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.
Contributor
Author
|
@martin-g Thanks for this great review! I am on it. |
92899a7 to
2e3aff9
Compare
a55ce6d to
4141fe9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This change introduces a default
FileStatisticsCacheimplementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :Add heap size estimation for file statistics and the relevant data types used in caching (This is temporary until Add a crate for HeapSize trait arrow-rs#9138 is resolved)
Redesign
DefaultFileStatisticsCacheto use aLruQueueto make it memory-bound following Adds memory-bound DefaultListFilesCache #18855Introduce a size limit and use it together with the heap-size to limit the memory usage of the cache
Move
FileStatisticsCachecreation intoCacheManager, making it session-scoped and shared across statements and tables.Disable caching in some of the SQL-logic tests where the change altered the output result, because the cache is now session-scoped and not query-scoped anymore.
Closes Add a default
FileStatisticsCacheimplementation for theListingTable#19217Closes Add limit to
DefaultFileStatisticsCache#19052Rationale for this change
See above.
What changes are included in this PR?
See above.
Are these changes tested?
Yes.
Are there any user-facing changes?
A new runtime setting
datafusion.runtime.file_statistics.cache_limit