[pull] master from libretro:master#933
Merged
pull[bot] merged 17 commits intoAlexandre1er:masterfrom Apr 16, 2026
Merged
Conversation
Replace per-unload vkQueueWaitIdle calls with deferred deletion
piggybacked on the existing frame fence infrastructure.
Problem
=======
vulkan_unload_texture_internal called vkQueueWaitIdle on every
texture unload. This drains the entire GPU command queue and
blocks the CPU until all submitted work completes. During menu
navigation (scrolling playlists, switching tabs), dozens of icon
textures are unloaded in rapid succession, producing cumulative
stalls of 5-20ms — visible as micro-hitches on lower-end
hardware.
There was even a TODO comment in the code:
/* TODO: We really want to defer this deletion instead,
* but this will do for now. */
Solution
========
Per-frame-slot deferred deletion lists. When a texture is
unloaded, its Vulkan resource handles (VkImage, VkImageView,
VkDeviceMemory, VkBuffer) are copied into a fixed-size array
indexed by the current frame slot. The texture wrapper struct
is freed immediately (it's just a malloc'd pointer), but the
GPU resources are kept alive.
At the start of each frame in vulkan_frame, after the fence
wait for the current slot has completed (guaranteeing all GPU
work from the previous use of this slot is done), the deferred
list for that slot is flushed — all resources are destroyed.
This is the standard Vulkan deferred-delete pattern used by
most game engines and renderers.
Data structure
==============
struct {
struct vk_texture entries[MAX_SWAPCHAIN_IMAGES][64];
unsigned count[MAX_SWAPCHAIN_IMAGES];
} deferred;
Fixed-size arrays avoid heap allocation. 64 entries per slot
is generous — a typical menu category switch unloads ~20-30
icons. If the limit is ever reached (extremely unlikely),
the code falls back to synchronous vkQueueWaitIdle for that
single texture, preserving correctness.
The struct is zero-initialized by calloc at driver creation
(line 4900), so no explicit init is needed.
Flush points
============
1. vulkan_frame — start of each frame, after fence wait.
Flushes the current frame slot.
2. vulkan_free — after vkQueueWaitIdle. Flushes all slots
to ensure no resources leak on shutdown.
3. vulkan_check_swapchain — after vkQueueWaitIdle during
swapchain reinit. Flushes all slots before tearing down
pipelines and framebuffers.
Thread safety
=============
All deferred list operations (append in unload, flush in frame)
run exclusively on the video thread:
- Texture unloads are dispatched to the video thread via
video_thread_texture_handle (commit 17332fe).
- vulkan_frame runs on the video thread.
- vulkan_free and vulkan_check_swapchain run after the video
thread has been joined or drained.
No cross-thread access to the deferred lists. No locks needed.
Fence correctness
=================
The frame fence guarantees that all GPU work submitted before
the corresponding vkQueueSubmit has completed. A texture last
drawn in frame N is unloaded during frame N or later. The
unload tags the resources on slot S = current_frame_index.
The resources are freed when slot S is reused (frame S +
num_swapchain_images), after waiting on fence F(S). Since
frame N was submitted no later than frame S, and fences signal
in submission order, F(S) implies F(N) has already signalled.
The GPU is guaranteed done with the texture. Spec-level
guarantee, no driver-dependent behavior.
Vulkan 1.0 compatible — uses only vkWaitForFences (existing),
vkDestroyImage, vkDestroyImageView, vkFreeMemory, vkDestroyBuffer.
No extensions required.
Performance impact
==================
Menu scrolling: eliminates per-unload GPU stalls. Texture
destruction is batched and amortized across frame boundaries.
Frame rendering: one flush call per frame iterating a short
array (typically 0 entries, occasionally 1-30). Negligible.
Memory: deferred textures hold GPU memory for up to
num_swapchain_images frames longer than before. For menu
icons (~64KB each, ~20-30 per batch), this is <2MB of
additional transient memory. Negligible on any GPU.
Replace per-unload Signal + WaitForSingleObject calls with deferred deletion, matching the Vulkan optimization in e537127. Problem ======= d3d12_gfx_unload_texture_internal called Signal + WaitFor- SingleObject on the command queue fence for every texture unload. This drains the entire GPU pipeline and blocks the CPU until all submitted work completes. During menu navigation (scrolling playlists, switching tabs), dozens of icon textures are unloaded in rapid succession, producing cumulative stalls visible as micro-hitches. Solution ======== When a texture is unloaded, copy its D3D12 resource handles (D3D12Resource, upload buffer, descriptor heap slots) into a fixed-size deferred list on d3d12_video_t. The malloc'd texture wrapper is freed immediately. At the start of each frame in d3d12_gfx_frame, after the existing Signal + WaitForSingleObject fence drain (which guarantees all prior GPU work has completed), flush the deferred list — release all descriptor heap slots and COM Release all D3D12 resources. D3D12's per-frame fence is a full drain (Signal the current value, then wait until GetCompletedValue catches up), so a single flat list is sufficient — no per-frame-slot tracking needed (unlike the Vulkan implementation which uses per-slot arrays because its fences are per-slot). Data structure ============== struct { d3d12_texture_t entries[64]; unsigned count; } deferred; Fixed-size array, zero-initialized by calloc at driver creation. 64 entries is generous for the typical ~20-30 icon unloads per category switch. Overflow falls back to the original synchronous Signal + WaitForSingleObject for that single texture. Flush points ============ 1. d3d12_gfx_frame — after the per-frame fence wait. Flushes the entire list. 2. d3d12_gfx_free — after the shutdown fence drain. Ensures no resources leak. The swapchain resize path (D3D12_ST_FLAG_RESIZE_CHAIN) runs inside d3d12_gfx_frame after the fence wait + deferred flush, so deferred textures are already freed before any chain resize releases render targets. No additional flush point needed. Thread safety ============= All deferred list operations run on the video thread: - Appends: via video_thread_texture_handle (commit a2d7b48) - Flushes: in d3d12_gfx_frame (video thread) and d3d12_gfx_free (after video thread joined) No cross-thread access. No locks needed. Descriptor heap safety ====================== d3d12_release_texture releases descriptor heap slots (heap->map[i] = false) and COM Releases the D3D12 resources. The descriptor heap lives on d3d12_video_t, created once at init and destroyed at shutdown. Deferred flush runs before shutdown destroys the heap. Descriptor slot pointers in deferred copies remain valid for the driver's lifetime. Performance impact ================== Same as the Vulkan optimization: eliminates per-unload GPU stalls during menu navigation. Texture destruction batched and amortized across frame boundaries. Deferred textures hold GPU memory for at most one additional frame (~2MB for a batch of 30 menu icons).
…ycle
Three fixes for the PS3 RSX video driver:
1. Fix texture memory leak
===========================
rsx_unload_texture had rsxFree(texture->data) commented out
with a "TODO fix crash on loading core" note. Every menu
texture unload leaked the RSX memory allocation permanently.
On PS3 with 256MB RAM, extended menu navigation (scrolling
large playlists, switching categories) would exhaust RSX
memory and crash.
Fix: uncomment the rsxFree call and add rsxFinish(context, 0)
before it to drain the RSX command queue. rsxFinish blocks
until all submitted GPU commands complete, ensuring the GPU is
no longer reading the texture memory before it is freed. This
is equivalent to the vkQueueWaitIdle / D3D12 Signal+Wait
pattern used in the other GPU drivers.
2. Fix font memory leak
========================
rsx_font_free had identical #if 0 blocks around rsxFree for
both the font atlas texture (font->texture.data) and the font
vertex buffer (font->vertices). Every font init/free cycle
(font scale change, context reset) leaked both allocations.
Fix: uncomment both rsxFree calls and add rsxFinish before
them. The existing gcmSetWaitFlip only waits for the page
flip, not for all draw commands to complete — rsxFinish
provides the full drain needed for safe resource destruction.
3. Route texture load/unload through video thread
==================================================
When threaded video is active, the main thread called
rsx_load_texture and rsx_unload_texture directly. This
produced two races:
- rsx_load_texture_data calls memcpy into RSX-visible memory
that the GPU may be reading on the video thread.
- rsx_unload_texture (even with the rsxFinish fix) calls
rsxFinish on the RSX context from the main thread while
the video thread may be submitting commands to the same
context.
Fix: split load/unload into internal functions and dispatch
through video_thread_texture_handle when threaded, matching
the pattern established in the D3D10/D3D11/D3D12/Vulkan/Metal
drivers. The rsx_texture_cmd_t struct carries video_data,
image, filter_type, and handle (output for load).
Added #include "../video_thread_wrapper.h" (guarded by
HAVE_THREADS) and #include "../font_driver.h".
The rsxFinish drain in unload is the PS3 equivalent of
vkQueueWaitIdle — it stalls the GPU pipeline on every unload.
A deferred-delete optimization is not practical on RSX because
the hardware lacks per-frame fences. The stall is acceptable
on PS3 where menu rendering is not performance-critical and
the alternative (memory leak) is worse.
No behavioural change when threaded video is disabled.
gx2_unload_texture called MEM2_free on the texture backing memory without ensuring the GPU had finished reading it. gx2_font_free called MEM1_free on the font atlas and UBO without a drain. Add GX2DrawDone() before the free calls in both functions. GX2DrawDone blocks until all submitted GPU commands complete, matching the pattern already used in gx2_free (line 1569) and gx2_gfx_set_shader (line 236). The Wii U build does not define HAVE_THREADS, so there is no cross-thread race — the main thread handles both rendering and resource management sequentially. The drain is still needed because GX2 uses a command buffer model where draw commands are recorded and submitted asynchronously to the GPU. Between the last draw call that referenced a texture and the unload call, the GPU may not have finished processing the command buffer. GX2DrawDone ensures it has.
MaterialUI was missing the context generation counter that XMB (commit 2ac0afa) and Ozone already have. When threaded video is active, materialui_context_destroy runs on the main thread (freeing textures and fonts) while materialui_frame may be mid-render on the video thread. The existing texture snapshot pattern (commit 303168b) captures handle values at frame start, but after context_destroy frees the underlying GPU resources, those snapshot values point to freed resources. Drawing with freed texture handles is undefined behaviour — typically a GPU fault or visual corruption depending on the driver. Similarly, context_destroy sets font pointers to NULL after freeing. If the video thread has already passed the font_bind calls but hasn't reached font_flush yet, the flush dereferences the freed font renderer data. Fix: add the same uint32_t context_generation counter pattern used by XMB and Ozone. - materialui_handle_t gains a context_generation field - materialui_context_destroy increments it before freeing any resources - materialui_frame snapshots it at frame start, checks before font_bind (first resource access) and before font_unbind (after all drawing), bailing to ctx_destroyed on mismatch - ctx_destroyed label skips font unbind and viewport restore, falling through to scope cleanup The generation counter is a uint32_t that wraps at 4 billion increments — effectively never wraps in practice since context destroy/reset cycles happen at most a few times per session. No behavioural change when threaded video is disabled (the generation never changes mid-frame on a single thread).
D3D9 Cg, D3D9 HLSL, D3D10, D3D11, D3D12, and Metal each had
a static gfx_ctx_driver_t installed during init so that
video_context_driver_get_flags() would return valid flags.
These drivers provide flags through their poke interface, not
through a context driver, but video_context_driver_get_flags()
only checked the context driver.
This affected any code path calling the function — shader
preset loading, Shaders menu visibility, shader type
validation, and shader manager initialization.
Fix: make video_context_driver_get_flags() merge flags from
both sources — context driver first, then OR in poke flags.
For GL/Vulkan: the context driver provides platform flags
(shader types, adaptive vsync, core context) and the poke
interface provides driver flags (hard sync, BFI, screenshots).
Both sets are now returned together.
For D3D/Metal: no context driver exists, so context flags are
empty. The poke interface provides all flags. This is what
the fake context was simulating.
With this change, the fake context workaround is unnecessary:
video_driver.c:
- video_context_driver_get_flags merges both flag sources
- Shader preset loading moved from driver init to
video_driver_init_internal, after poke registration
- Guarded by !current_video_context.get_flags (skips
GL/Vulkan which load shaders via real context driver)
- Guarded by non-empty preset path (avoids pointless
device reinit on D3D9 for NULL path)
D3D9 Cg, D3D9 HLSL, D3D10, D3D11, D3D12:
- Removed static fake_context variable
- Removed fake context setup + shader loading from init
Metal:
- Removed 45-line fake context struct
- Removed no-op metal_ctx_swap_interval
- Removed fake context setup + shader loading from init
- Direct metal_ctx_swap_buffers(NULL) call replaces
indirect metal_fake_context.swap_buffers(NULL)
- Removed unused shader_path variable
Note: D3D9 HLSL/Cg crash when toggling threaded video.
This is pre-existing and unrelated to this change.
video_context_driver_get_flags() now merges flags from both the poke interface and the context driver (commit be2d747). This made two functions redundant: - video_driver_get_flags (static) — checked poke only - video_driver_get_flags_wrapper — checked poke then context Both are subsets of what video_context_driver_get_flags() now does. Remove them and simplify video_driver_test_all_flags() to a single video_context_driver_get_flags() call.
The task->error field was in practice only used as a flag, actual message was never shown to the user.
D3D8, D3D9 HLSL, and D3D9 Cg created their Win32 window by calling win32_set_style, win32_window_create, and win32_set_window directly. D3D10, D3D11, and D3D12 use the shared win32_set_video_mode helper which does the same three calls plus a GetMessage dispatch loop. Replace the manual sequence with win32_set_video_mode in all three drivers for consistency. Removed now-unused style and rect local variables. Note: D3D9 HLSL/Cg still crash when toggling threaded video. The crash is a COM double-release during driver teardown (Release called on a stale IDirect3D9 COM pointer), which is a pre-existing bug unrelated to window creation.
The threaded video wrapper's custom command pipeline used
32-bit types for texture handles:
- custom_command_method_t returned int (32-bit)
- custom_command.return_value was int (32-bit)
- video_thread_texture_handle returned unsigned (32-bit)
On 64-bit platforms, texture handles are pointers (uintptr_t).
When a D3D9 COM pointer has upper 32 bits set (e.g.
0x0000020C4E7E2800), the return value is truncated to
0x4E7E2800. The unload path then calls Release() on this
truncated pointer, crashing with an access violation.
This is the root cause of the D3D9 HLSL/Cg crash when running
with threaded video on 64-bit Windows. It affects any driver
whose GPU allocations return addresses above 4GB.
Fix: change the entire pipeline to use uintptr_t:
video_thread_wrapper.h:
- custom_command_method_t returns uintptr_t
- custom_command.return_value is uintptr_t
- video_thread_texture_handle returns uintptr_t
video_thread_wrapper.c:
- Matching implementation change
All video drivers + font_driver.c:
- All _wrap callback functions return uintptr_t
14 files changed, 28 insertions, 28 deletions.
metal_texture_load_wrap was missed in commit b1dc1f3 because the .m file wasn't matched by the sed pattern used for .c files. Fixes build error on macOS/iOS.
The centralized shader loading in video_driver_init_internal called set_shader even with a NULL/empty preset path. On D3D9, set_shader with no path triggers d3d9_hlsl_restore which does a full deinitialize + initialize cycle — tearing down and rebuilding the render chain, fonts, and vertex declarations that were just created during init. This is wasteful on all drivers and actively harmful on D3D9 where the unnecessary device reinit cycle can leave stale COM references. Guard with a non-empty preset check. If no shader preset is configured, the driver's stock shader from init is already correct.
g_pD3D9 was a global extern shared between the D3D9 HLSL and
D3D9 Cg drivers. g_pD3D8 was a file-scoped static in the
D3D8 driver. Both stored the IDirect3D object (the factory
used to create the device) as global state.
This is problematic because:
- The g_pD3D9 extern couples two independent drivers
- Global COM pointers survive driver reinit and can become
stale if teardown order is wrong
- Each driver instance should own its own IDirect3D object
Fix: move the IDirect3D objects into their respective driver
structs.
d3d9_common.h:
- Added LPDIRECT3D9 d3d9 field to d3d9_video_t
- Removed extern LPDIRECT3D9 g_pD3D9
d3d9_common.c:
- Removed LPDIRECT3D9 g_pD3D9 global
- d3d9_get_color_format_backbuffer takes LPDIRECT3D9 param
d3d9hlsl.c, d3d9cg.c:
- All g_pD3D9 references replaced with d3d->d3d9
d3d8.c:
- Added LPDIRECT3D8 d3d8 field to d3d8_video_t
- Removed static LPDIRECT3D8 g_pD3D8
- All g_pD3D8 references replaced with d3d->d3d8
- d3d8_get_color_format_backbuffer takes LPDIRECT3D8 param
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )