Add program caches (in-memory, sqlite, filestream)#1912
Add program caches (in-memory, sqlite, filestream)#1912cpcloud wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
f1ae40e to
b27ed2c
Compare
2dc5c8f to
5da111b
Compare
|
Generated with the help of Cursor GPT-5.4 Extra High Fast High:
|
|
Thanks, Phillip! I have this PR in my review backlog 🙏 The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers. |
3a32786 to
cad93d0
Compare
|
Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note). High -- source-directory include. Medium -- over-eviction race. Low -- star-import. Added a note in |
|
@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:
Cross-process coverage in
One concurrency bug this review shook out (over-eviction after a suppressed |
Convert cuda.core.utils to a package and add ObjectCode caches for
artifacts produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager. Path-backed ObjectCode is rejected at write
time (would store only the path, not the bytes).
* InMemoryProgramCache -- in-process OrderedDict backend that
stores entries by reference (no pickling). Optional max_entries
and max_size_bytes caps with LRU eviction. __getitem__ promotes
LRU; __contains__ is read-only. threading.RLock serialises every
method.
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction and an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the cap bounds real on-disk usage. __contains__ is
read-only. __len__ prunes corrupt rows. Schema-mismatch on open
drops tables and rebuilds; corrupt / non-SQLite files reinitialise
empty; transient OperationalError propagates without nuking the
file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. blake2b(32)
hashed filenames so arbitrary-length keys never overflow
filesystem limits. Reader pruning, clear(), and _enforce_size_cap
are all stat-guarded (inode/size/mtime snapshot; refuse unlink on
mismatch) so a concurrent writer's os.replace is preserved.
_enforce_size_cap also decrements its running ``total`` when a
concurrent deleter wins the unlink race, so a suppressed
FileNotFoundError cannot over-evict newly committed entries.
Stale temp files swept on open; live temps count toward the size
cap. Windows ERROR_SHARING_VIOLATION (32) and ERROR_LOCK_VIOLATION
(33) on os.replace are retried with bounded backoff (~185ms)
before being treated as a non-fatal cache miss; other
PermissionError and all POSIX failures propagate.
* make_program_cache_key -- stable 32-byte blake2b digest over code,
code_type, ProgramOptions, target_type, name expressions, and
environment probes: cuda-core version, NVRTC version, NVVM lib+IR
version, linker backend+version for PTX inputs (driver version
only on the cuLink path). Backend-specific gates mirror
Program/Linker:
- code_type lower-cased to match Program_init.
- code_type/target_type validated against Program's
SUPPORTED_TARGETS matrix.
- NVRTC side-effect options (create_pch, time,
fdevice_time_trace) and external-content options
(include_path, pre_include, pch, use_pch, pch_dir) require
an extra_digest. NVVM use_libdevice=True likewise. NVRTC
options.name with a directory component (e.g. '/abs/k.cu')
also requires extra_digest (or no_source_include=True) because
NVRTC searches that directory for #include \"...\" lookups;
bare labels fall back to CWD and stay accepted.
- extra_sources rejected for non-NVVM; bytes-like ``code``
rejected for non-NVVM.
- PTX (Linker) options pass through per-field gates that match
_prepare_nvjitlink_options / _prepare_driver_options;
ptxas_options canonicalised across str/list/tuple/empty
shapes; driver-linker hard rejections (time, ptxas_options,
split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma
collapse under the driver linker.
- name_expressions gated on backend == \"nvrtc\".
- Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide
with working ones while staying stable across processes and
repeated calls.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
not pull in any cache backend. The cache classes and
make_program_cache_key are exposed via module __getattr__.
_LAZY_CACHE_ATTRS is a single ordered tuple spliced into __all__ via
``*_LAZY_CACHE_ATTRS`` so the two lists cannot drift; star-import
still walks __all__ and therefore resolves every lazy attribute,
which is expected given star-imports are discouraged anyway.
sqlite3 is imported lazily inside SQLiteProgramCache.__init__ so the
package is usable on interpreters built without libsqlite3.
Tests: ~200 cache tests covering single-process CRUD for all three
backends; LRU/size-cap (logical and on-disk, including stat-guarded
race scenarios); over-eviction race (monkeypatched Path.unlink);
InMemory combined caps, overwrite-updates-size, LRU-touch-on-read,
contains-does-not-bump, degenerate caps (single entry > cap,
max_entries=0); NVRTC source-directory path-name guard with
POSIX/Windows separators and both accept paths; corruption +
__len__ pruning; schema-mismatch table-DROP; threaded SQLite and
InMemory (4 writers + 4 readers x 200 ops); cross-process
FileStream stress (writer/reader race exercising the stat-guard
prune; clear/eviction race injection via generator cleanup);
Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow
+ retry, others propagate; partial-conn close on OperationalError);
lazy-import subprocess test; _SUPPORTED_TARGETS_BY_CODE_TYPE parity
test that parses _program.pyx via tokenize + ast.literal_eval; and
end-to-end real CUDA C++ compile -> store -> reopen -> get_kernel
roundtrip parametrized over the two persistent backends.
Closes NVIDIA#177
Closes NVIDIA#178
Closes NVIDIA#179
457cab7 to
cfddd08
Compare
Summary
cuda.core.utilsfrom a module to a package; expose cache APIs lazily via__getattr__sofrom cuda.core.utils import StridedMemoryViewstays lightweight._LAZY_CACHE_ATTRSis a single ordered tuple spliced into__all__via*_LAZY_CACHE_ATTRS, and the module docstring notes that the laziness guarantee is for explicit imports only (star-import walks__all__and therefore resolves every lazy attribute).ProgramCacheResourceABC withbytes | strkeys, context manager, pickle-safety warning, and rejection of path-backedObjectCodeat write time.make_program_cache_key()— blake2b(32) digest with backend-specific gates that mirrorProgram/Linker:code_type/target_typeagainstProgram.compile'sSUPPORTED_TARGETS; rejects bytes-likecodefor non-NVVM andextra_sourcesfor non-NVVM.create_pch,time,fdevice_time_trace) and external-content (include_path,pre_include,pch,use_pch,pch_dir) options requireextra_digest; NVVMuse_libdevice=Truelikewise.options.namewith a directory component (e.g./path/to/kernel.cu) also requiresextra_digestbecause NVRTC searches that directory for#include "..."lookups; bare labels ("default_program","kernel-a") fall back to CWD and stay accepted.no_source_include=Truedisables the search and the guard._prepare_nvjitlink_options/_prepare_driver_options;ptxas_optionscanonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time,ptxas_options,split_compile) raise at key time;ftz/prec_div/prec_sqrt/fmacollapse under driver linker.*_probe_failedlabel so broken environments never collide with working ones, while staying stable across processes and repeated calls.InMemoryProgramCache,SQLiteProgramCache,FileStreamProgramCache— all of which implementProgramCacheResource. See Backends below for design, benefits, and tradeoffs of each.Program.compile(cache=...)integration is out of scope (tracked by #176).Backends
All three implement
ProgramCacheResourceand share the key schema. The two persistent backends pickleObjectCodeatpickle.HIGHEST_PROTOCOL; the in-memory backend stores it by reference. They differ in storage, concurrency model, and eviction policy.InMemoryProgramCacheOrderedDict(no pickling)RLock)max_entries+max_size_bytesSQLiteProgramCacheRLock); multi-process possible but not the recommended shapeaccessed_atupdated on reads); hardmax_size_bytesat quiescent pointsFileStreamProgramCacheos.replace; stat-guarded prunesmtime(oldest written); softmax_size_bytesInMemoryProgramCacheDesign
collections.OrderedDictmapping key-digest →(ObjectCode, size). Insertion order encodes LRU — oldest at the front, newest at the back. Values are stored by reference (no pickle round-trip), which is why lookups are the fastest of the three.__getitem__moves the entry to the back to promote it.__contains__is read-only, so a membership probe doesn't shift LRU order.__setitem__updates the entry and then calls_evict_to_caps(), which pops from the front until both optional caps (max_entries,max_size_bytes) are satisfied.threading.RLockserialises every method, so a reader's LRU bump and a writer's eviction can't interleave.Benefits
Tradeoffs
ObjectCodemutates the cached entry.Use when artifacts only need to live for the lifetime of the process.
SQLiteProgramCacheDesign
entriestable: blake2b key-digest PK (BLOB), pickledObjectCodepayload (BLOB),size_bytes,created_at,accessed_at(REAL), with an index onaccessed_atfor LRU scans.schema_metatable records_SQLITE_SCHEMA_VERSION.accessed_at— so eviction always removes the genuinely least-recently-used row.max_size_bytesis set, delete from the head ofORDER BY accessed_at ASCuntil the running sum is under the cap, then runwal_checkpoint(TRUNCATE) + VACUUMto reclaim disk.threading.RLockserialises connection use;check_same_thread=Falselets one cache move between threads.DatabaseError(corruption-shaped) wipes the DB plus its-wal/-shmcompanions and reinitialises empty;OperationalError(lock/busy) propagates without nuking the file and closes any partial connection.Benefits
wal_checkpoint(TRUNCATE) + VACUUMbounds real on-disk size after evictions.Tradeoffs
VACUUM/wal_checkpoint(TRUNCATE)are skipped while any reader or writer is active, so on-disk size drifts abovemax_size_bytesuntil activity settles. For strict on-disk bounds under concurrent load,FileStreamProgramCacheis the right backend.InMemoryProgramCache).Use when you want single-process persistent caching under a hard size cap where eviction should reflect actual access frequency rather than write order. The unique win over
FileStreamProgramCacheis read-aware LRU.FileStreamProgramCacheDesign
<root>/entries/<blake2b-digest>, holding a pickled(schema, stored_key, payload, created_at)record wherepayloadis the pickledObjectCode. A siblingSCHEMA_VERSIONfile records_FILESTREAM_SCHEMA_VERSION; a mismatch wipes incompatible entries on open.stored_keyagainst the requested key — so a hash collision surfaces as a key mismatch, not silent corruption.<root>/tmp/<uuid>,fsync, thenos.replaceinto place. Readers never observe a partial entry. On Windows,os.replaceretries with bounded backoff (~185 ms) onERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATIONbefore dropping to a non-fatal cache miss._enforce_size_cap()lists entries with a stat snapshot, sorts bymtime, and unlinks oldest-first. Each unlink is stat-guarded —_prune_if_stat_unchanged()compares(ino, size, mtime_ns)against the snapshot and refuses if they differ, so a fresh entry a peer just committed viaos.replacesurvives eviction. The runningtotaldecrements whenever a peer wins the unlink race, so over-eviction after a suppressedFileNotFoundErrorcan't cascade.Benefits
Tradeoffs
mtime, so under heavy read reuse a hot entry can be dropped because it was written earliest.max_size_bytescap is soft; concurrent writers may briefly exceed it.fsynconly; the containing directory is notfsync-ed, so a host crash between write and the next directory commit may lose recently added entries. Surviving entries remain consistent.Use when multiple processes may hit the cache: parallel build workers, pytest-xdist, distributed training launchers, or any setup with several writers against one cache.
Examples
Program.compile(cache=...)integration is out of scope (tracked by #176), so the current pattern is explicit key derivation +cache.get/cache[key] = .... The loop below is identical for all three backends —ProgramCacheResourceis the only interface the caller sees.The differences between the backends are in how each is constructed and what guarantees it offers.
In-process hot loop —
InMemoryProgramCacheNotebook or REPL compiling many kernel variants (parameter sweeps, autotuning). Fastest, lives for the process.
Per-user persistent cache —
SQLiteProgramCacheSingle-user CLI tool or long-running service on one machine. One file on disk, reopen across runs, read-aware LRU so hot entries survive eviction.
Parallel workers —
FileStreamProgramCachepytest-xdist, CI matrix, or any multi-process build system. Every worker opens the same directory; atomic
os.replacecommits keep concurrent writers safe.Read-aware vs write-order LRU
The two persistent backends diverge when
max_size_bytesis tight and one entry is being re-read while others are being written:For read-heavy single-process workloads,
SQLiteProgramCachekeeps the hot entry alive. For multi-process workloads, the lack of cross-process LRU coordination is what makesFileStreamProgramCachesafe under concurrent writers — the tradeoff usually goes that way.Test plan
~200 cache tests total, grouped as:
InMemoryProgramCache: combined caps, overwrite-updates-size, LRU-touch-on-read, contains-does-not-bump-LRU, degenerate caps (single entry > cap,max_entries=0)__len__pruning of bad rows/filesSQLiteProgramCacheopenSQLiteProgramCachestress (4 writers + 4 readers × 200 ops)InMemoryProgramCachestressFileStreamProgramCachestress: writer/reader race exercising the stat-guard prune;clear()/ eviction race injection via generator cleanupPath.unlinksimulates a concurrent deleter winning exactly once, asserts the fresh entry survivesPermissionErrornarrowing: winerror 32/33 swallow + retry, all other codes propagate; partial-connection close onOperationalErrorfrom cuda.core.utils import StridedMemoryViewdoesn't pull in the cache modules_SUPPORTED_TARGETS_BY_CODE_TYPEparity test parses_program.pyxviatokenize+ast.literal_evalto keep the cache-key validator in sync withProgram.compile's supported-target mapget_kernelon the deserialisedObjectCode, parametrized over the two persistent backendsCloses #177
Closes #178
Closes #179