-
Notifications
You must be signed in to change notification settings - Fork 76
Python Relational API - Generated SQL query embeds data #372
Description
What happens?
Using the python relational API sometimes leads to the dataset being downloaded and data being embedded into the generated SQL which is then executed. This causes a major slowdown which I assume is generation of the massive query.
On my modestly sized data set (about 3 million rows) running the following takes about 1 sec:
df = conn.sql("SELECT * FROM demand_history WHERE year(interval_date) = 2025").pl()
Doin this instead, which I expected to be exactly the same takes 40 secs:
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
I've attached the top 18 lines of the output of `conn.table("demand_history").filter("year(interval_date) = 2025").sql_query()'. The entire query is 500MB. sql_query.txt
It unfortunately only seems to happen with a large amount of data so I haven't been able to produce a minimal example and I haven't narrowed down the exact conditions that causes this to happen. If I for instance do a `between date1 and date2' filter for a short period (a few days) I don't see this behavior - printing the SQL query looks exactly as I'd expect - but doing it for a longer period (a few months) results in the unexpected behavior.
I also don't know if this is a DuckLake bug or a DuckDB bug, but I was never able to reproduce it using just DuckDB. This is with Azure file storage using the Azure extension (3918e45) so that may also be the culprit.
To Reproduce
connection.sql:
t0 = time.perf_counter()
df = conn.sql("select * from demand_history where year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 1.61s
connection.table:
with open("sql_query.txt", "a") as f:
print(conn.table("demand_history").filter("year(interval_date) = 2025").sql_query(), file=f)
t0 = time.perf_counter()
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 35.19s (faster than usual but point stands)
Apologies for not providing something more reproducible but I'm struggling to produce a simple reproducible example. Let me know if there is any more info that would be helpful.
OS:
Ubuntu 22.04.5 LTS x86_64
DuckDB Version:
1.4.4
DuckLake Version:
3f1b372
DuckDB Client:
Python (3.12.3)
Hardware:
No response
Full Name:
Ethan Spencer
Affiliation:
ZEN Energy
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
No - I cannot easily share my data sets due to their large size
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have