Skip to content

Add program caches (in-memory, sqlite, filestream)#1912

Open
cpcloud wants to merge 1 commit intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178
Open

Add program caches (in-memory, sqlite, filestream)#1912
cpcloud wants to merge 1 commit intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

  • Convert cuda.core.utils from a module to a package; expose cache APIs lazily via __getattr__ so from cuda.core.utils import StridedMemoryView stays lightweight. _LAZY_CACHE_ATTRS is a single ordered tuple spliced into __all__ via *_LAZY_CACHE_ATTRS, and the module docstring notes that the laziness guarantee is for explicit imports only (star-import walks __all__ and therefore resolves every lazy attribute).
  • Add ProgramCacheResource ABC with bytes | str keys, context manager, pickle-safety warning, and rejection of path-backed ObjectCode at write time.
  • Add make_program_cache_key() — blake2b(32) digest with backend-specific gates that mirror Program/Linker:
    • Versions: cuda-core, NVRTC (c++), libNVVM lib+IR (nvvm), linker backend+version (ptx); driver only on the cuLink path.
    • Validates code_type/target_type against Program.compile's SUPPORTED_TARGETS; rejects bytes-like code for non-NVVM and extra_sources for non-NVVM.
    • NVRTC side-effect (create_pch, time, fdevice_time_trace) and external-content (include_path, pre_include, pch, use_pch, pch_dir) options require extra_digest; NVVM use_libdevice=True likewise.
    • NVRTC options.name with a directory component (e.g. /path/to/kernel.cu) also requires extra_digest because NVRTC searches that directory for #include "..." lookups; bare labels ("default_program", "kernel-a") fall back to CWD and stay accepted. no_source_include=True disables the search and the guard.
    • PTX (Linker) options pass through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options; ptxas_options canonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time, ptxas_options, split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma collapse under driver linker.
    • Failed env probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and repeated calls.
  • Add three concrete backends — InMemoryProgramCache, SQLiteProgramCache, FileStreamProgramCache — all of which implement ProgramCacheResource. See Backends below for design, benefits, and tradeoffs of each.

Program.compile(cache=...) integration is out of scope (tracked by #176).

Backends

All three implement ProgramCacheResource and share the key schema. The two persistent backends pickle ObjectCode at pickle.HIGHEST_PROTOCOL; the in-memory backend stores it by reference. They differ in storage, concurrency model, and eviction policy.

Backend Storage Concurrency Eviction
InMemoryProgramCache in-process OrderedDict (no pickling) threads (RLock) true LRU; optional max_entries + max_size_bytes
SQLiteProgramCache one sqlite3 file (WAL) threads (RLock); multi-process possible but not the recommended shape true per-read LRU (accessed_at updated on reads); hard max_size_bytes at quiescent points
FileStreamProgramCache directory of atomic files multi-process via temp + os.replace; stat-guarded prunes oldest mtime (oldest written); soft max_size_bytes

InMemoryProgramCache

Design

  • Storage. collections.OrderedDict mapping key-digest → (ObjectCode, size). Insertion order encodes LRU — oldest at the front, newest at the back. Values are stored by reference (no pickle round-trip), which is why lookups are the fastest of the three.
  • Reads. __getitem__ moves the entry to the back to promote it. __contains__ is read-only, so a membership probe doesn't shift LRU order.
  • Writes. __setitem__ updates the entry and then calls _evict_to_caps(), which pops from the front until both optional caps (max_entries, max_size_bytes) are satisfied.
  • Concurrency. A threading.RLock serialises every method, so a reader's LRU bump and a writer's eviction can't interleave.

Benefits

  • No pickle overhead; strict per-read LRU; both entry-count and byte caps.
  • Simple: no disk, no schema, no cross-process concerns.

Tradeoffs

  • Process-local; the cache dies with the process.
  • Entries are shared by reference, so mutating a retrieved ObjectCode mutates the cached entry.

Use when artifacts only need to live for the lifetime of the process.

SQLiteProgramCache

Design

  • Storage. One sqlite3 file in WAL + autocommit mode. entries table: blake2b key-digest PK (BLOB), pickled ObjectCode payload (BLOB), size_bytes, created_at, accessed_at (REAL), with an index on accessed_at for LRU scans. schema_meta table records _SQLITE_SCHEMA_VERSION.
  • Reads. SELECT the payload, UPDATE accessed_at — so eviction always removes the genuinely least-recently-used row.
  • Writes. UPSERT. When max_size_bytes is set, delete from the head of ORDER BY accessed_at ASC until the running sum is under the cap, then run wal_checkpoint(TRUNCATE) + VACUUM to reclaim disk.
  • Concurrency. A threading.RLock serialises connection use; check_same_thread=False lets one cache move between threads.
  • Recovery. On open — schema-version mismatch drops the cache tables and rebuilds; DatabaseError (corruption-shaped) wipes the DB plus its -wal / -shm companions and reinitialises empty; OperationalError (lock/busy) propagates without nuking the file and closes any partial connection.

Benefits

  • Persistent cache packaged as a single file — easy to ship, delete, or locate in a temp dir.
  • True per-read LRU: evictions drop the genuinely least-recently-used entry, not the oldest written.
  • wal_checkpoint(TRUNCATE) + VACUUM bounds real on-disk size after evictions.
  • Corruption degrades to an empty cache rather than breaking the caller.

Tradeoffs

  • WAL permits multi-process sharing, but VACUUM / wal_checkpoint(TRUNCATE) are skipped while any reader or writer is active, so on-disk size drifts above max_size_bytes until activity settles. For strict on-disk bounds under concurrent load, FileStreamProgramCache is the right backend.
  • Pickle serialisation on every write and read (unlike InMemoryProgramCache).

Use when you want single-process persistent caching under a hard size cap where eviction should reflect actual access frequency rather than write order. The unique win over FileStreamProgramCache is read-aware LRU.

FileStreamProgramCache

Design

  • Storage. One file per entry at <root>/entries/<blake2b-digest>, holding a pickled (schema, stored_key, payload, created_at) record where payload is the pickled ObjectCode. A sibling SCHEMA_VERSION file records _FILESTREAM_SCHEMA_VERSION; a mismatch wipes incompatible entries on open.
  • Reads. Load the record, re-verify stored_key against the requested key — so a hash collision surfaces as a key mismatch, not silent corruption.
  • Writes. Stage <root>/tmp/<uuid>, fsync, then os.replace into place. Readers never observe a partial entry. On Windows, os.replace retries with bounded backoff (~185 ms) on ERROR_SHARING_VIOLATION / ERROR_LOCK_VIOLATION before dropping to a non-fatal cache miss.
  • Eviction. _enforce_size_cap() lists entries with a stat snapshot, sorts by mtime, and unlinks oldest-first. Each unlink is stat-guarded — _prune_if_stat_unchanged() compares (ino, size, mtime_ns) against the snapshot and refuses if they differ, so a fresh entry a peer just committed via os.replace survives eviction. The running total decrements whenever a peer wins the unlink race, so over-eviction after a suppressed FileNotFoundError can't cascade.
  • Recovery. Orphan temp files from crashed writers are swept on open, age-based so in-flight writes from other processes are preserved.

Benefits

  • Multi-process safe without OS-level locks: atomic writes, stat-guarded prunes, bounded Windows retry, and a running total that survives lost unlink races.
  • Hash-named files so arbitrarily-long keys fit within per-component filename limits.
  • Corruption on one entry doesn't corrupt the whole cache — reads prune the bad file and report a miss.

Tradeoffs

  • No cross-process access tracking — eviction picks by mtime, so under heavy read reuse a hot entry can be dropped because it was written earliest.
  • The max_size_bytes cap is soft; concurrent writers may briefly exceed it.
  • Directory of entries rather than a single file (more filesystem metadata overhead).
  • Per-entry fsync only; the containing directory is not fsync-ed, so a host crash between write and the next directory commit may lose recently added entries. Surviving entries remain consistent.

Use when multiple processes may hit the cache: parallel build workers, pytest-xdist, distributed training launchers, or any setup with several writers against one cache.

Examples

Program.compile(cache=...) integration is out of scope (tracked by #176), so the current pattern is explicit key derivation + cache.get / cache[key] = .... The loop below is identical for all three backends — ProgramCacheResource is the only interface the caller sees.

from cuda.core import Program, ProgramOptions
from cuda.core.utils import make_program_cache_key

code = 'extern "C" __global__ void my_kernel() {}'
options = ProgramOptions(arch="sm_90", name="my_kernel")
key = make_program_cache_key(
    code=code, code_type="c++", options=options, target_type="cubin",
)

compiled = cache.get(key)
if compiled is None:
    program = Program(code, "c++", options=options)
    try:
        compiled = program.compile("cubin")
    finally:
        program.close()
    cache[key] = compiled

The differences between the backends are in how each is constructed and what guarantees it offers.

In-process hot loop — InMemoryProgramCache

Notebook or REPL compiling many kernel variants (parameter sweeps, autotuning). Fastest, lives for the process.

from cuda.core.utils import InMemoryProgramCache

cache = InMemoryProgramCache(
    max_entries=128,
    max_size_bytes=64 * 1024 * 1024,
)
for arch in ("sm_80", "sm_90", "sm_100"):
    options = ProgramOptions(arch=arch, name="my_kernel")
    key = make_program_cache_key(code=code, code_type="c++", options=options, target_type="cubin")
    if (obj := cache.get(key)) is None:
        program = Program(code, "c++", options=options)
        try:
            obj = program.compile("cubin")
        finally:
            program.close()
        cache[key] = obj
    run(obj)

Per-user persistent cache — SQLiteProgramCache

Single-user CLI tool or long-running service on one machine. One file on disk, reopen across runs, read-aware LRU so hot entries survive eviction.

from pathlib import Path
from cuda.core.utils import SQLiteProgramCache

cache_path = Path.home() / ".cache/mytool/programs.sqlite"
with SQLiteProgramCache(cache_path, max_size_bytes=256 * 1024 * 1024) as cache:
    # Re-reading a previously-compiled entry updates its accessed_at, so it
    # survives the next eviction even if older writes still dominate mtime order.
    compiled = cache.get(key) or compile_and_store(cache, key, code, options)

Parallel workers — FileStreamProgramCache

pytest-xdist, CI matrix, or any multi-process build system. Every worker opens the same directory; atomic os.replace commits keep concurrent writers safe.

# Each xdist worker:
from cuda.core.utils import FileStreamProgramCache

with FileStreamProgramCache("/var/tmp/ci/program-cache", max_size_bytes=1 << 30) as cache:
    compiled = cache.get(key) or compile_and_store(cache, key, code, options)

Read-aware vs write-order LRU

The two persistent backends diverge when max_size_bytes is tight and one entry is being re-read while others are being written:

# SQLiteProgramCache — reads update accessed_at
cache[k_a] = obj_a          # writes a
cache[k_b] = obj_b          # writes b
_ = cache[k_a]              # promotes a to most-recently-used
cache[k_c] = obj_c          # over cap → evicts b (genuinely LRU)

# FileStreamProgramCache — eviction picks oldest mtime
cache[k_a] = obj_a          # writes a
cache[k_b] = obj_b          # writes b
_ = cache[k_a]              # mtime unchanged; read doesn't touch file metadata
cache[k_c] = obj_c          # over cap → evicts a (oldest written, regardless of reuse)

For read-heavy single-process workloads, SQLiteProgramCache keeps the hot entry alive. For multi-process workloads, the lack of cross-process LRU coordination is what makes FileStreamProgramCache safe under concurrent writers — the tradeoff usually goes that way.

Test plan

~200 cache tests total, grouped as:

  • CRUD, caps, and corruption
    • Single-process CRUD for all three backends
    • LRU and size-cap enforcement (logical totals and real on-disk bytes)
    • InMemoryProgramCache: combined caps, overwrite-updates-size, LRU-touch-on-read, contains-does-not-bump-LRU, degenerate caps (single entry > cap, max_entries=0)
    • Corruption handling with __len__ pruning of bad rows/files
    • Schema-mismatch table-DROP on SQLiteProgramCache open
  • Concurrency
    • Threaded SQLiteProgramCache stress (4 writers + 4 readers × 200 ops)
    • Threaded InMemoryProgramCache stress
    • Cross-process FileStreamProgramCache stress: writer/reader race exercising the stat-guard prune; clear() / eviction race injection via generator cleanup
    • Over-eviction race: monkeypatched Path.unlink simulates a concurrent deleter winning exactly once, asserts the fresh entry survives
  • Platform and edge cases
    • Windows vs POSIX PermissionError narrowing: winerror 32/33 swallow + retry, all other codes propagate; partial-connection close on OperationalError
    • NVRTC source-directory path-name guard: POSIX and Windows separators, both with accept-paths
    • Lazy-import subprocess test confirms from cuda.core.utils import StridedMemoryView doesn't pull in the cache modules
    • _SUPPORTED_TARGETS_BY_CODE_TYPE parity test parses _program.pyx via tokenize + ast.literal_eval to keep the cache-key validator in sync with Program.compile's supported-target map
  • End-to-end: real CUDA C++ compile → store in cache → reopen → get_kernel on the deserialised ObjectCode, parametrized over the two persistent backends
  • CI: clean across all platforms

Closes #177
Closes #178
Closes #179

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 3 times, most recently from 2dc5c8f to 5da111b Compare April 20, 2026 12:18
@cpcloud cpcloud requested review from leofang and rwgk April 20, 2026 13:21
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 20, 2026

Generated with the help of Cursor GPT-5.4 Extra High Fast


High: make_program_cache_key() misses implicit source-directory header dependencies

make_program_cache_key() only forces extra_digest for explicit include/PCH options in cuda_core/cuda/core/utils/_program_cache.py:393 and cuda_core/cuda/core/utils/_program_cache.py:592, but NVRTC also implicitly searches the source file's directory unless no_source_include is set in cuda_core/cuda/core/_program.pyx:1001.

Program passes options.name straight to nvrtcCreateProgram() in cuda_core/cuda/core/_program.pyx:635, while the key builder only hashes that path string in cuda_core/cuda/core/utils/_program_cache.py:778. That means a workflow like options.name="/path/to/kernel.cu" plus #include "local.h" can reuse a stale cached ObjectCode after local.h changes.

The new tests cover explicit include/PCH knobs, but not this default source-directory include path (cuda_core/tests/test_program_cache.py:765).

Medium: FileStreamProgramCache._enforce_size_cap() can over-evict under concurrent capped writers

After the re-stat at cuda_core/cuda/core/utils/_program_cache.py:1515, a concurrent deleter can remove the candidate before path.unlink(). That FileNotFoundError is suppressed at cuda_core/cuda/core/utils/_program_cache.py:1530, but total is not adjusted, so eviction continues and can delete newer entries unnecessarily.

For a backend explicitly documented for multi-process use, that turns ordinary contention into avoidable cache data loss. The current multiprocess coverage exercises concurrent writes and prune races, but not max_size_bytes under concurrency (cuda_core/tests/test_program_cache_multiprocess.py).

Reduced simulation I used locally:

import time
from pathlib import Path

from cuda.core._module import ObjectCode
from cuda.core.utils import FileStreamProgramCache

cache = FileStreamProgramCache("/tmp/cuda_cache_review_race", max_size_bytes=1000)
cache[b"old"] = ObjectCode._init(b"a" * 600, "cubin", name="old")
time.sleep(0.01)

old_path = cache._path_for_key(b"old")
orig_unlink = Path.unlink
state = {"done": False}


def flaky_unlink(self, *args, **kwargs):
    if self == old_path and not state["done"]:
        state["done"] = True
        # Simulate another process deleting the file after stat() but before
        # _enforce_size_cap() updates its bookkeeping.
        orig_unlink(self, *args, **kwargs)
        raise FileNotFoundError(self)
    return orig_unlink(self, *args, **kwargs)


Path.unlink = flaky_unlink
try:
    cache[b"new"] = ObjectCode._init(b"b" * 600, "cubin", name="new")
finally:
    Path.unlink = orig_unlink

remaining = [key for key in (b"old", b"new") if cache.get(key) is not None]
print(remaining)  # []

That produced [] for me: once the first deletion race is swallowed without decrementing total, the loop keeps evicting and drops the fresh entry too.

Low: from cuda.core.utils import * now eagerly imports the cache stack

The package conversion keeps explicit imports like from cuda.core.utils import StridedMemoryView lightweight, but from cuda.core.utils import * walks __all__, resolves the lazy cache symbols, and imports _program_cache (cuda_core/cuda/core/utils/__init__.py:10, cuda_core/cuda/core/utils/__init__.py:32).

I verified that star-import now loads cuda.core.utils._program_cache. That said, this only affects import *, which is already discouraged. I think a short comment explaining that the laziness guarantee is intended for explicit imports, not star-import, seems sufficient here.

Comment thread cuda_core/cuda/core/utils/__init__.py Outdated
@leofang
Copy link
Copy Markdown
Member

leofang commented Apr 22, 2026

Thanks, Phillip! I have this PR in my review backlog 🙏

The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers.

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 3a32786 to cad93d0 Compare April 22, 2026 12:04
@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 22, 2026

Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note).

High -- source-directory include. make_program_cache_key() now refuses to build an NVRTC key when options.name contains a directory separator and neither extra_digest nor no_source_include=True is set. Scoping the guard to names that actually introduce a new search directory (/abs/kernel.cu, rel/kernel.cu, C:\src\kernel.cu) keeps bare labels like "default_program" or "kernel-a" -- which fall back to CWD, the same search root every NVRTC compile sees -- accepted unchanged. Tests cover POSIX and Windows separators, the extra_digest and no_source_include=True accept paths, and confirm the guard is NVRTC-only (PTX and NVVM unaffected).

Medium -- over-eviction race. FileStreamProgramCache._enforce_size_cap() now decrements total whether it unlinks the candidate itself or a concurrent pruner already removed the file. The FileNotFoundError is still suppressed, but the accounting now matches reality, so the loop stops as soon as the cap is met. Added a test that monkeypatches Path.unlink to simulate a concurrent deleter winning exactly once, then verifies the freshly-committed entry survives.

Low -- star-import. Added a note in cuda_core/cuda/core/utils/__init__.py that the laziness guarantee is for explicit imports only -- from cuda.core.utils import * walks __all__ and therefore resolves every lazy attribute. Star-imports are discouraged anyway, so treat that as expected.

@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 22, 2026

@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:

InMemoryProgramCache -- thread-safe, not process-safe. Dict-backed (OrderedDict) cache that lives only in the owning process. threading.RLock serialises every method (__getitem__, __setitem__, __contains__, __delitem__, __len__, clear, and the internal eviction pass) so threads can share one cache object without external locking. It is not process-safe by design: each process has its own independent cache, there is no shared state or IPC; for multi-process sharing use SQLiteProgramCache or FileStreamProgramCache. Note that entries are stored by reference, not copied -- a thread that mutates a returned ObjectCode affects the cached entry, so callers must treat reads as read-only. Stressed by a threaded test with 4 writers + 4 readers x 200 ops against a cache with max_entries set, verifying the cache stays consistent and never exceeds the cap.

SQLiteProgramCache -- thread-safe; multi-process best-effort. check_same_thread=False on the connection plus a threading.RLock serialises every connection-touching method, so threads cannot interleave a read/update or a write/VACUUM pair. WAL + autocommit on open. Stressed by a 4 writers + 4 readers x 200 ops test in test_program_cache.py. Sharing the sqlite file across processes does work (sqlite3 WAL serialises writes at the file level and our size-cap/VACUUM passes run under the same WAL discipline), but the threading.RLock does not cross process boundaries and the VACUUM pass skips under active readers, so the on-disk file can temporarily grow above max_size_bytes until readers release. For workloads with many concurrent processes, FileStreamProgramCache is the better fit.

FileStreamProgramCache -- thread-safe and process-safe. Every write lands on a per-write temp file and is promoted via os.replace, so a reader/writer race either sees the old entry or the new one -- never a half-written file. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded: before unlinking, the code re-stats the candidate and refuses if (ino, size, mtime_ns) differs from the snapshot, so a concurrent writer's os.replace is preserved. Stale temp files are swept on open. On Windows, os.replace can surface ERROR_SHARING_VIOLATION (32) / ERROR_LOCK_VIOLATION (33) against a reader briefly holding the handle; the code retries with bounded backoff (~185ms total) before treating it as a non-fatal cache miss -- all other PermissionErrors and POSIX failures propagate.

Cross-process coverage in test_program_cache_multiprocess.py:

  • concurrent writers producing overlapping keys
  • a writer/reader race exercising the stat-guarded prune path
  • clear/eviction race injection via generator cleanup (the cleanup code after the last yield runs at StopIteration, which is exactly between _enforce_size_cap's scan and its eviction loop)
  • Windows PermissionError narrowing (winerror 32/33 swallow + retry, all others propagate)

One concurrency bug this review shook out (over-eviction after a suppressed FileNotFoundError in _enforce_size_cap) is fixed with its own test. If you see a DOE-style pattern from CuPy's cache that we don't cover yet, happy to add a test that reproduces it -- mapping that stress-testing onto this backend would be useful.

@cpcloud cpcloud changed the title Add PersistentProgramCache (sqlite + filestream backends) Add program caches (in-memory, sqlite, filestream) Apr 22, 2026
Convert cuda.core.utils to a package and add ObjectCode caches for
artifacts produced by Program.compile.

Public API (cuda.core.utils):
  * ProgramCacheResource   -- abstract bytes|str -> ObjectCode mapping
    with context manager. Path-backed ObjectCode is rejected at write
    time (would store only the path, not the bytes).
  * InMemoryProgramCache   -- in-process OrderedDict backend that
    stores entries by reference (no pickling). Optional max_entries
    and max_size_bytes caps with LRU eviction. __getitem__ promotes
    LRU; __contains__ is read-only. threading.RLock serialises every
    method.
  * SQLiteProgramCache     -- single-file sqlite3 backend (WAL mode,
    autocommit) with LRU eviction and an optional size cap. A
    threading.RLock serialises connection use so one cache object is
    safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
    evictions so the cap bounds real on-disk usage. __contains__ is
    read-only. __len__ prunes corrupt rows. Schema-mismatch on open
    drops tables and rebuilds; corrupt / non-SQLite files reinitialise
    empty; transient OperationalError propagates without nuking the
    file (and closes the partial connection).
  * FileStreamProgramCache -- directory of atomically-written entries
    (tmp + os.replace) safe across concurrent processes. blake2b(32)
    hashed filenames so arbitrary-length keys never overflow
    filesystem limits. Reader pruning, clear(), and _enforce_size_cap
    are all stat-guarded (inode/size/mtime snapshot; refuse unlink on
    mismatch) so a concurrent writer's os.replace is preserved.
    _enforce_size_cap also decrements its running ``total`` when a
    concurrent deleter wins the unlink race, so a suppressed
    FileNotFoundError cannot over-evict newly committed entries.
    Stale temp files swept on open; live temps count toward the size
    cap. Windows ERROR_SHARING_VIOLATION (32) and ERROR_LOCK_VIOLATION
    (33) on os.replace are retried with bounded backoff (~185ms)
    before being treated as a non-fatal cache miss; other
    PermissionError and all POSIX failures propagate.
  * make_program_cache_key -- stable 32-byte blake2b digest over code,
    code_type, ProgramOptions, target_type, name expressions, and
    environment probes: cuda-core version, NVRTC version, NVVM lib+IR
    version, linker backend+version for PTX inputs (driver version
    only on the cuLink path). Backend-specific gates mirror
    Program/Linker:
      - code_type lower-cased to match Program_init.
      - code_type/target_type validated against Program's
        SUPPORTED_TARGETS matrix.
      - NVRTC side-effect options (create_pch, time,
        fdevice_time_trace) and external-content options
        (include_path, pre_include, pch, use_pch, pch_dir) require
        an extra_digest. NVVM use_libdevice=True likewise. NVRTC
        options.name with a directory component (e.g. '/abs/k.cu')
        also requires extra_digest (or no_source_include=True) because
        NVRTC searches that directory for #include \"...\" lookups;
        bare labels fall back to CWD and stay accepted.
      - extra_sources rejected for non-NVVM; bytes-like ``code``
        rejected for non-NVVM.
      - PTX (Linker) options pass through per-field gates that match
        _prepare_nvjitlink_options / _prepare_driver_options;
        ptxas_options canonicalised across str/list/tuple/empty
        shapes; driver-linker hard rejections (time, ptxas_options,
        split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma
        collapse under the driver linker.
      - name_expressions gated on backend == \"nvrtc\".
      - Failed environment probes mix the exception class name into a
        *_probe_failed label so broken environments never collide
        with working ones while staying stable across processes and
        repeated calls.

Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
not pull in any cache backend. The cache classes and
make_program_cache_key are exposed via module __getattr__.
_LAZY_CACHE_ATTRS is a single ordered tuple spliced into __all__ via
``*_LAZY_CACHE_ATTRS`` so the two lists cannot drift; star-import
still walks __all__ and therefore resolves every lazy attribute,
which is expected given star-imports are discouraged anyway.
sqlite3 is imported lazily inside SQLiteProgramCache.__init__ so the
package is usable on interpreters built without libsqlite3.

Tests: ~200 cache tests covering single-process CRUD for all three
backends; LRU/size-cap (logical and on-disk, including stat-guarded
race scenarios); over-eviction race (monkeypatched Path.unlink);
InMemory combined caps, overwrite-updates-size, LRU-touch-on-read,
contains-does-not-bump, degenerate caps (single entry > cap,
max_entries=0); NVRTC source-directory path-name guard with
POSIX/Windows separators and both accept paths; corruption +
__len__ pruning; schema-mismatch table-DROP; threaded SQLite and
InMemory (4 writers + 4 readers x 200 ops); cross-process
FileStream stress (writer/reader race exercising the stat-guard
prune; clear/eviction race injection via generator cleanup);
Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow
+ retry, others propagate; partial-conn close on OperationalError);
lazy-import subprocess test; _SUPPORTED_TARGETS_BY_CODE_TYPE parity
test that parses _program.pyx via tokenize + ast.literal_eval; and
end-to-end real CUDA C++ compile -> store -> reopen -> get_kernel
roundtrip parametrized over the two persistent backends.

Closes NVIDIA#177
Closes NVIDIA#178
Closes NVIDIA#179
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 457cab7 to cfddd08 Compare April 23, 2026 12:35
@cpcloud cpcloud requested a review from rwgk April 23, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cuda.core.utils.ProgramCacheResource Add cuda.core.utils.PersistentProgramCache Add cuda.core.utils.InMemoryProgramCache

3 participants