Skip to content

add fread file connection support#7422

Merged
aitap merged 44 commits intomasterfrom
fread_connections
Mar 12, 2026
Merged

add fread file connection support#7422
aitap merged 44 commits intomasterfrom
fread_connections

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Nov 8, 2025

Closes #561
Also closes #4329 with a nice workaround by just wrapping the file to read with file() and use the connection interface.

Spills to file since this seems to cover almost all cases. Also respects nrows parameter, so for peeking it does not need to spill the whole file.

We can't magically make file connections faster since we do not have random access like with mmap.

For the benchmark below, note that my tempdir already points to my SSD (therefore we cant see a big difference) and I dont have a HDD on this PC.

fread_con_vs_readtable

Extending this to 1e8 rows instead of 1e7 and verbose also shows that half of the time spent is for spilling to disk (for large files).

Read 100000000 rows x 4 columns from 3.916GiB (4205299167 bytes) file in 00:02.151 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : int32     '7'
         2 : float64   '9'
         1 : string    'E'
=============================
   2.158s ( 50%) Spill connection to tempfile (3.916GiB)
   0.000s (  0%) Memory map 3.916GiB file
   0.002s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   0.413s ( 10%) Allocation of 109975410 rows x 4 cols (2.868GiB) of which 100000000 ( 91%) rows used
   1.736s ( 40%) Reading 4010 chunks (0 swept) of 1.000MiB (each chunk 24937 rows) using 10 threads
   +    0.575s ( 13%) Parse to row-major thread buffers (grown 0 times)
   +    0.759s ( 18%) Transpose
   +    0.402s (  9%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   4.309s        Total
Details
library(data.table)
library(atime)
set.seed(123)
N = 1e7
test_df <- data.frame(
    a = sample(1:1000, N, replace=TRUE),
    b = rnorm(N),
    c = sample(letters, N, replace=TRUE),
    d = runif(N)
)
f = tempfile(fileext = '.csv')
fwrite(test_df, f)

Nseq = 10^seq(2, log10(N), .25)
read = atime(N = Nseq, seconds.limit=1,
    fread_con = fread(file(f), nrows = N),
    fread_con_RAM = fread(file(f), nrows = N, tmpdir = "/dev/shm"),
    readtable = read.table(file(f), header=TRUE, sep=',', nrows = N),
    fread = fread(f, nrows = N)
)

plot(read)

@github-actions
Copy link

github-actions bot commented Nov 8, 2025

No obvious timing issues in HEAD=fread_connections
Comparison Plot

Generated via commit fd866b7

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 3 minutes and 14 seconds
Installing different package versions 23 seconds
Running and plotting the test cases 4 minutes and 0 seconds

@codecov
Copy link

codecov bot commented Nov 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.04%. Comparing base (7c410cd) to head (fd866b7).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7422   +/-   ##
=======================================
  Coverage   99.03%   99.04%           
=======================================
  Files          87       87           
  Lines       16930    17029   +99     
=======================================
+ Hits        16767    16866   +99     
  Misses        163      163           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aitap
Copy link
Member

aitap commented Nov 8, 2025 via email

@ben-schwen
Copy link
Member Author

I guess there are some more cool kids on the CRAN block using R_GetConnection (although not too many)

Copy link
Member

@aitap aitap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good after fixing the potential resource leak caused by errors in R_ReadConnection.

@ben-schwen ben-schwen requested a review from aitap March 12, 2026 12:15
Copy link
Member

@aitap aitap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having to introduce the binary_reopener as another entry point to support, but there doesn't seem to be any other way to proceed given the current API. Thank you very much!

@aitap aitap merged commit d86e1f5 into master Mar 12, 2026
15 checks passed
@aitap aitap deleted the fread_connections branch March 12, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fread tries to map memory for the entire file when using nrows [R-Forge #4931] Support file connections for fread

3 participants