The tpch extension implements the data generator and queries for the
TPC-H benchmark Version 3.0.1, and provides an easy way to run all 22 queries in order.
make install compiles the TPC-H dbgen tools from source, so you need a C toolchain:
# Debian/Ubuntu
sudo apt-get install build-essential
# RHEL/Rocky/CentOS/Fedora
sudo dnf install gcc makeThen:
cd pg_tpch
make installNote: The TPC-H reference data (
ref_data/) and specification files are excluded from the git repository to keep it small. They are not needed to build or run the extension. If you need them for result validation, re-extract from the official TPC-H zip:unzip /path/to/TPC-H-TOOL.zip 'TPC-H V3.0.1/ref_data/*' 'TPC-H V3.0.1/specification.*' mv 'TPC-H V3.0.1/ref_data' 'TPC-H V3.0.1/specification.'* TPC-H-V3.0.1/ rm -r 'TPC-H V3.0.1'
Default PostgreSQL settings (shared_buffers = 128MB, work_mem = 4MB, etc.) are far too
conservative for analytical workloads. gen_pg_conf.py auto-detects your hardware (CPU,
RAM, disk type) and generates an optimized tpch_postgres.conf with recommended settings
synthesized from PostgreSQL Wiki
and EDB OLAP tuning guide,
combined with benchmark experience. The output is a starting point — adjust to your workload.
python3 gen_pg_conf.py # writes tpch_postgres.conf
python3 gen_pg_conf.py --dry-run # preview without writingApply it and restart:
echo "include = '$(pwd)/tpch_postgres.conf'" >> $(psql -tA -c "SHOW config_file")
pg_ctl restart -D $(psql -tA -c "SHOW data_directory")Key parameters tuned: shared_buffers (25% RAM), effective_cache_size (75% RAM),
work_mem, max_parallel_workers_per_gather, random_page_cost (SSD vs HDD),
max_wal_size, jit, and more. See the generated file for details.
CREATE EXTENSION tpch;
CALL tpch.run(); -- SF=1, single-threaded (default)run() executes the full pipeline: schema → data generation → load → query generation → benchmark.
-- For larger scale factors:
SELECT tpch.config('data_dir', '/data/tpch_tmp'); -- optional: set data dir (default: /tmp/tpch_data)
CALL tpch.run(100, 8); -- SF=100 (~100 GB), 8 parallel workersCREATE EXTENSION tpch;
SELECT tpch.gen_schema(); -- 1. create 8 TPC-H tables
SELECT tpch.gen_data(1, 8); -- 2. generate SF-1 (~1 GB) .tbl files, 8 parallel workers
SELECT tpch.load_data(8); -- 3. load .tbl files into tables, 8 parallel workers
SELECT tpch.gen_query(); -- 4. generate 22 queries (scale auto-detected from gen_data)
SELECT tpch.bench(); -- 5. run all 22 queries
SELECT tpch.clean_data(); -- 6. (optional) delete .tbl files to free disk spaceCheck the latest results:
SELECT * FROM tpch.bench_summary;Built and tested on PostgreSQL 19devel. Older versions should also work. If not, please create an issue.
| Function | Returns | Description |
|---|---|---|
tpch.config(key) |
TEXT | Get config value |
tpch.config(key, value) |
TEXT | Set config value |
tpch.info() |
TABLE | Show all resolved paths and scale factor |
tpch.run(scale, parallel=1) |
— | Full pipeline: gen_schema → gen_data → load_data → gen_query → bench |
tpch.gen_schema() |
TEXT | Create 8 TPC-H tables under tpch schema |
tpch.gen_data(scale, parallel=1) |
TEXT | Generate .tbl files via dbgen |
tpch.load_data(workers=4) |
TEXT | Load .tbl files into tables and analyze |
tpch.clean_data() |
TEXT | Delete .tbl files from data_dir to free disk space |
tpch.gen_query(scale=auto) |
TEXT | Generate 22 queries; scale auto-detected from gen_data, default 1 |
tpch.show(qid) |
TEXT | Return query text |
tpch.exec(qid) |
TEXT | Execute one query, save result to tpch.bench_results |
tpch.bench(mode) |
TEXT | Run or explain all 22 queries, update bench_summary |
tpch.explain(qid, opts) |
SETOF TEXT | EXPLAIN a single query |
Runs the complete benchmark pipeline in one call.
CALL tpch.run(); -- SF=1, single-threaded
CALL tpch.run(100, 8); -- SF=100, 8 dbgen workersparallel controls both data generation (dbgen workers) and loading (table-COPY workers, capped
internally at LEAST(parallel, 8) since TPC-H has only 8 tables).
Show the query 6's text.
SELECT tpch.show(6);
show
----------------------------------------------
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1994-01-01'
and l_shipdate < date '1994-01-01' + interval '1' year
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 24;See the plan of query 6.
SELECT * FROM tpch.explain(6, 'COSTS OFF');
explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate
-> Seq Scan on lineitem
Filter: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone) AND (l_discount >= 0.05) AND (l_discount <= 0.07) AND (l_quantity < '24'::numeric))
(3 rows)Run all 22 queries and record results.
SELECT tpch.bench(); -- execute all 22 queries
SELECT tpch.bench('EXPLAIN'); -- explain all 22 queries
SELECT tpch.bench('EXPLAIN (ANALYZE, COSTS OFF)'); -- explain with optionsOutput saved to results_dir:
- Per-query:
query1.out...query22.out(orquery1_explain.out...query22_explain.out) - Summary:
summary.csv— query_id, status, duration_ms, rows_returned
tpch.bench_summary is updated after each run with the latest results.
| Table | Populated by | Description |
|---|---|---|
tpch.config |
CREATE EXTENSION |
Configuration |
tpch.query |
gen_query() |
22 generated query texts |
tpch.bench_summary |
bench() |
Latest run: query_id, status, duration_ms, rows_returned (updated each run) |
tpch.bench_results |
exec() / bench() |
All historical results (appended each run) |
| 8 data tables | gen_schema() + gen_data() |
lineitem, orders, customer, part, supplier, partsupp, nation, region |
| Directory | Contents |
|---|---|
query_dir |
query1.sql ... query22.sql from gen_query() |
results_dir |
Per-query .out files and summary.csv from bench() |
data_dir |
Temporary .tbl files from gen_data() (default: /tmp/tpch_data) |
Check all resolved paths:
SELECT * FROM tpch.info();Everything works out of the box. All directories except data_dir are auto-detected under the extension install path. Optional overrides:
SELECT tpch.config('data_dir', '/data/tpch');
SELECT tpch.config('results_dir', '/data/results');
SELECT tpch.config('query_dir', '/data/queries');Disk space warning:
gen_data()writes raw.tblfiles todata_dirbefore loading them into PostgreSQL. The.tblfiles are roughly the same size as the loaded data (~1 GB per scale factor). Make suredata_dirhas enough free space — at least 2× the scale factor in GB to account for both the.tblfiles and the database storage. The defaultdata_diris/tmp/tpch_data, which may be too small for large scale factors. Set it to a partition with sufficient space before runninggen_data():SELECT tpch.config('data_dir', '/data/tpch_tmp'); SELECT tpch.gen_data(100, 8); -- SF=100 needs ~100 GB in data_dir
gen_query() automatically patches raw qgen output:
- Interval precision —
interval '90' day (3)→interval '90' day(strip unsupported precision qualifier) - Standalone LIMIT — relocates
LIMIT Nfrom separate line into the query body - Control directives — strips qgen directives (
:x,:o,:n, etc.) and carriage returns - Query 15 (Top Supplier) —
CREATE VIEW/DROP VIEWhandled as DDL in explain and bench modes
PostgreSQL License. See LICENSE file.
Disclaimer: TPC-H is a trademark of the Transaction Processing Performance Council (TPC). This extension is not an official TPC benchmark implementation. Any results produced should be referred to as "derived from" or "based on" TPC-H. See NOTICE file.