Performance: parallel file I/O and optimized reading by wojtek · Pull Request #44 · zeux/qgrep

wojtek · 2026-02-03T00:51:33Z

Summary

Add parallel file I/O with read-ahead buffering for index building
Use Windows FILE_FLAG_SEQUENTIAL_SCAN for better prefetching
Pre-allocate file buffers to avoid O(n²) reallocation
Minor optimizations: memchr for line finding, unordered_map for regex cache

Benchmark Results (UE5.6.1 Engine - 186k files, 2GB)

Scenario	Baseline	Improved	Speedup
Cold Build	4.47s	2.72s	1.64x (39% faster)
Incremental	1.15s	1.14s	~1.0x (no change)

Incremental updates only scan metadata, so no improvement expected there.

Changes

fileutil_win.cpp - readFileOptimized() with FILE_FLAG_SEQUENTIAL_SCAN
build.cpp - Parallel read-ahead with multiple reader threads
stringutil.hpp - Use memchr for faster line-end finding
blockingqueue.hpp - notify_one instead of notify_all
project.cpp - unordered_map for regex cache

Use Windows API directly with FILE_FLAG_SEQUENTIAL_SCAN hint for better prefetching when reading files sequentially. This improves I/O throughput during index building. The POSIX implementation returns empty to allow fallback to standard file reading.

Replace manual loop with memchr() which is typically optimized with SIMD instructions for faster scanning.

Only one waiting producer needs to be woken when space becomes available, reducing unnecessary thread wake-ups.

Replace std::map with std::unordered_map for O(1) average lookup instead of O(log n) when caching compiled regex patterns.

- Pre-allocate file buffer using size hint to avoid O(n^2) reallocation - Use readFileOptimized() with FILE_FLAG_SEQUENTIAL_SCAN on Windows - Fallback to standard FileStream for non-Windows or special files

Use multiple reader threads to overlap file I/O with processing during index building. Reader threads read ahead while the main thread consumes files in order, improving throughput on systems with high I/O latency. The number of reader threads scales with available CPU cores, and a sliding window prevents readers from getting too far ahead.

wojtek · 2026-02-03T00:56:19Z

COLD BUILD:
BASELINE: 7.96s, 4.45s, 4.48s → avg 4.47s (excl. cold cache)
IMPROVED: 2.68s, 2.66s, 2.83s → avg 2.72s
SPEEDUP: 1.64x (39% faster)

INCREMENTAL:
BASELINE: 1.14s, 1.16s, 1.16s → avg 1.15s
IMPROVED: 1.14s, 1.15s, 1.13s → avg 1.14s
SPEEDUP: ~1.0x (no change)

Tested on the following config:

qgrep config for Unreal Engine 5.6.1

path E:/UnrealEngine-5.6.1-release/Engine

include .(ini)$
include .(cpp|c|h|hpp|cc|inl)$
include .(ispc|isph)$
include .(cs|vb)$
include .(cmake)$
include .(java|js|kt|kts|ts|tsx)$
include .(md|rst|txt)$
include .(pl|py|pm|rb)$
include .(rs)$
include .(usf|ush|hlsl|glsl|cg|fx|cgfx)$
include .(xml|yml|yaml)$
include .(uplugin|uproject)$
include .(sh|bat)$
exclude ^DerivedDataCache/
exclude ^Intermediate/

186,283 files, 2GB input, 456MB index

zeux · 2026-02-26T05:16:43Z

Is the main improvement due to read-ahead? The other changes (modulo some issues) would be easy to merge but that part is more unwieldy. So I'm wondering which changes are responsible for which speedups.

notify_all -> notify_one change is not always efficient. If queue reaches its maximum size and multiple threads wait on it, removal of a large item should wake all consumers, as otherwise the parallelism might be limited.

Changing the order of normalizeEOL & convertUTF8 is incorrect and would break UTF16 files as far as I can tell.

Some of the changes like map -> unordered_map, findLineEnd and sizeHint are no-brainers and could easily be merged if submitted separately.

wojtek added 6 commits February 3, 2026 00:42

Use memchr for faster line-end finding

52f47ac

Replace manual loop with memchr() which is typically optimized with SIMD instructions for faster scanning.

Use notify_one instead of notify_all in blocking queue

ebaadb6

Only one waiting producer needs to be woken when space becomes available, reducing unnecessary thread wake-ups.

Use unordered_map for regex cache

35b7a6b

Replace std::map with std::unordered_map for O(1) average lookup instead of O(log n) when caching compiled regex patterns.

Optimize file reading with pre-allocation and Windows API

d768c5c

- Pre-allocate file buffer using size hint to avoid O(n^2) reallocation - Use readFileOptimized() with FILE_FLAG_SEQUENTIAL_SCAN on Windows - Fallback to standard FileStream for non-Windows or special files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: parallel file I/O and optimized reading#44

Performance: parallel file I/O and optimized reading#44
wojtek wants to merge 6 commits intozeux:masterfrom
wojtek:performance-improvements

wojtek commented Feb 3, 2026

Uh oh!

wojtek commented Feb 3, 2026

Uh oh!

zeux commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wojtek commented Feb 3, 2026

Summary

Benchmark Results (UE5.6.1 Engine - 186k files, 2GB)

Changes

Uh oh!

wojtek commented Feb 3, 2026

qgrep config for Unreal Engine 5.6.1

Uh oh!

zeux commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants