Skip to content

[pull] master from libretro:master#933

Merged
pull[bot] merged 17 commits intoAlexandre1er:masterfrom
libretro:master
Apr 16, 2026
Merged

[pull] master from libretro:master#933
pull[bot] merged 17 commits intoAlexandre1er:masterfrom
libretro:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 16, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

LibretroAdmin and others added 17 commits April 16, 2026 18:27
Replace per-unload vkQueueWaitIdle calls with deferred deletion
piggybacked on the existing frame fence infrastructure.

Problem
=======

vulkan_unload_texture_internal called vkQueueWaitIdle on every
texture unload.  This drains the entire GPU command queue and
blocks the CPU until all submitted work completes.  During menu
navigation (scrolling playlists, switching tabs), dozens of icon
textures are unloaded in rapid succession, producing cumulative
stalls of 5-20ms — visible as micro-hitches on lower-end
hardware.

There was even a TODO comment in the code:
  /* TODO: We really want to defer this deletion instead,
   * but this will do for now. */

Solution
========

Per-frame-slot deferred deletion lists.  When a texture is
unloaded, its Vulkan resource handles (VkImage, VkImageView,
VkDeviceMemory, VkBuffer) are copied into a fixed-size array
indexed by the current frame slot.  The texture wrapper struct
is freed immediately (it's just a malloc'd pointer), but the
GPU resources are kept alive.

At the start of each frame in vulkan_frame, after the fence
wait for the current slot has completed (guaranteeing all GPU
work from the previous use of this slot is done), the deferred
list for that slot is flushed — all resources are destroyed.

This is the standard Vulkan deferred-delete pattern used by
most game engines and renderers.

Data structure
==============

  struct {
      struct vk_texture entries[MAX_SWAPCHAIN_IMAGES][64];
      unsigned count[MAX_SWAPCHAIN_IMAGES];
  } deferred;

Fixed-size arrays avoid heap allocation.  64 entries per slot
is generous — a typical menu category switch unloads ~20-30
icons.  If the limit is ever reached (extremely unlikely),
the code falls back to synchronous vkQueueWaitIdle for that
single texture, preserving correctness.

The struct is zero-initialized by calloc at driver creation
(line 4900), so no explicit init is needed.

Flush points
============

1. vulkan_frame — start of each frame, after fence wait.
   Flushes the current frame slot.

2. vulkan_free — after vkQueueWaitIdle.  Flushes all slots
   to ensure no resources leak on shutdown.

3. vulkan_check_swapchain — after vkQueueWaitIdle during
   swapchain reinit.  Flushes all slots before tearing down
   pipelines and framebuffers.

Thread safety
=============

All deferred list operations (append in unload, flush in frame)
run exclusively on the video thread:

  - Texture unloads are dispatched to the video thread via
    video_thread_texture_handle (commit 17332fe).
  - vulkan_frame runs on the video thread.
  - vulkan_free and vulkan_check_swapchain run after the video
    thread has been joined or drained.

No cross-thread access to the deferred lists.  No locks needed.

Fence correctness
=================

The frame fence guarantees that all GPU work submitted before
the corresponding vkQueueSubmit has completed.  A texture last
drawn in frame N is unloaded during frame N or later.  The
unload tags the resources on slot S = current_frame_index.
The resources are freed when slot S is reused (frame S +
num_swapchain_images), after waiting on fence F(S).  Since
frame N was submitted no later than frame S, and fences signal
in submission order, F(S) implies F(N) has already signalled.
The GPU is guaranteed done with the texture.  Spec-level
guarantee, no driver-dependent behavior.

Vulkan 1.0 compatible — uses only vkWaitForFences (existing),
vkDestroyImage, vkDestroyImageView, vkFreeMemory, vkDestroyBuffer.
No extensions required.

Performance impact
==================

Menu scrolling: eliminates per-unload GPU stalls.  Texture
destruction is batched and amortized across frame boundaries.

Frame rendering: one flush call per frame iterating a short
array (typically 0 entries, occasionally 1-30).  Negligible.

Memory: deferred textures hold GPU memory for up to
num_swapchain_images frames longer than before.  For menu
icons (~64KB each, ~20-30 per batch), this is <2MB of
additional transient memory.  Negligible on any GPU.
Replace per-unload Signal + WaitForSingleObject calls with
deferred deletion, matching the Vulkan optimization in e537127.

Problem
=======

d3d12_gfx_unload_texture_internal called Signal + WaitFor-
SingleObject on the command queue fence for every texture
unload.  This drains the entire GPU pipeline and blocks the
CPU until all submitted work completes.  During menu
navigation (scrolling playlists, switching tabs), dozens of
icon textures are unloaded in rapid succession, producing
cumulative stalls visible as micro-hitches.

Solution
========

When a texture is unloaded, copy its D3D12 resource handles
(D3D12Resource, upload buffer, descriptor heap slots) into a
fixed-size deferred list on d3d12_video_t.  The malloc'd
texture wrapper is freed immediately.

At the start of each frame in d3d12_gfx_frame, after the
existing Signal + WaitForSingleObject fence drain (which
guarantees all prior GPU work has completed), flush the
deferred list — release all descriptor heap slots and COM
Release all D3D12 resources.

D3D12's per-frame fence is a full drain (Signal the current
value, then wait until GetCompletedValue catches up), so a
single flat list is sufficient — no per-frame-slot tracking
needed (unlike the Vulkan implementation which uses per-slot
arrays because its fences are per-slot).

Data structure
==============

  struct {
      d3d12_texture_t entries[64];
      unsigned count;
  } deferred;

Fixed-size array, zero-initialized by calloc at driver
creation.  64 entries is generous for the typical ~20-30 icon
unloads per category switch.  Overflow falls back to the
original synchronous Signal + WaitForSingleObject for that
single texture.

Flush points
============

1. d3d12_gfx_frame — after the per-frame fence wait.
   Flushes the entire list.

2. d3d12_gfx_free — after the shutdown fence drain.
   Ensures no resources leak.

The swapchain resize path (D3D12_ST_FLAG_RESIZE_CHAIN) runs
inside d3d12_gfx_frame after the fence wait + deferred flush,
so deferred textures are already freed before any chain resize
releases render targets.  No additional flush point needed.

Thread safety
=============

All deferred list operations run on the video thread:
  - Appends: via video_thread_texture_handle (commit a2d7b48)
  - Flushes: in d3d12_gfx_frame (video thread) and
    d3d12_gfx_free (after video thread joined)

No cross-thread access.  No locks needed.

Descriptor heap safety
======================

d3d12_release_texture releases descriptor heap slots
(heap->map[i] = false) and COM Releases the D3D12 resources.
The descriptor heap lives on d3d12_video_t, created once at
init and destroyed at shutdown.  Deferred flush runs before
shutdown destroys the heap.  Descriptor slot pointers in
deferred copies remain valid for the driver's lifetime.

Performance impact
==================

Same as the Vulkan optimization: eliminates per-unload GPU
stalls during menu navigation.  Texture destruction batched
and amortized across frame boundaries.  Deferred textures
hold GPU memory for at most one additional frame (~2MB for
a batch of 30 menu icons).
…ycle

Three fixes for the PS3 RSX video driver:

1. Fix texture memory leak
===========================

rsx_unload_texture had rsxFree(texture->data) commented out
with a "TODO fix crash on loading core" note.  Every menu
texture unload leaked the RSX memory allocation permanently.
On PS3 with 256MB RAM, extended menu navigation (scrolling
large playlists, switching categories) would exhaust RSX
memory and crash.

Fix: uncomment the rsxFree call and add rsxFinish(context, 0)
before it to drain the RSX command queue.  rsxFinish blocks
until all submitted GPU commands complete, ensuring the GPU is
no longer reading the texture memory before it is freed.  This
is equivalent to the vkQueueWaitIdle / D3D12 Signal+Wait
pattern used in the other GPU drivers.

2. Fix font memory leak
========================

rsx_font_free had identical #if 0 blocks around rsxFree for
both the font atlas texture (font->texture.data) and the font
vertex buffer (font->vertices).  Every font init/free cycle
(font scale change, context reset) leaked both allocations.

Fix: uncomment both rsxFree calls and add rsxFinish before
them.  The existing gcmSetWaitFlip only waits for the page
flip, not for all draw commands to complete — rsxFinish
provides the full drain needed for safe resource destruction.

3. Route texture load/unload through video thread
==================================================

When threaded video is active, the main thread called
rsx_load_texture and rsx_unload_texture directly.  This
produced two races:

  - rsx_load_texture_data calls memcpy into RSX-visible memory
    that the GPU may be reading on the video thread.

  - rsx_unload_texture (even with the rsxFinish fix) calls
    rsxFinish on the RSX context from the main thread while
    the video thread may be submitting commands to the same
    context.

Fix: split load/unload into internal functions and dispatch
through video_thread_texture_handle when threaded, matching
the pattern established in the D3D10/D3D11/D3D12/Vulkan/Metal
drivers.  The rsx_texture_cmd_t struct carries video_data,
image, filter_type, and handle (output for load).

Added #include "../video_thread_wrapper.h" (guarded by
HAVE_THREADS) and #include "../font_driver.h".

The rsxFinish drain in unload is the PS3 equivalent of
vkQueueWaitIdle — it stalls the GPU pipeline on every unload.
A deferred-delete optimization is not practical on RSX because
the hardware lacks per-frame fences.  The stall is acceptable
on PS3 where menu rendering is not performance-critical and
the alternative (memory leak) is worse.

No behavioural change when threaded video is disabled.
gx2_unload_texture called MEM2_free on the texture backing
memory without ensuring the GPU had finished reading it.
gx2_font_free called MEM1_free on the font atlas and UBO
without a drain.

Add GX2DrawDone() before the free calls in both functions.
GX2DrawDone blocks until all submitted GPU commands complete,
matching the pattern already used in gx2_free (line 1569) and
gx2_gfx_set_shader (line 236).

The Wii U build does not define HAVE_THREADS, so there is no
cross-thread race — the main thread handles both rendering and
resource management sequentially.  The drain is still needed
because GX2 uses a command buffer model where draw commands are
recorded and submitted asynchronously to the GPU.  Between the
last draw call that referenced a texture and the unload call,
the GPU may not have finished processing the command buffer.
GX2DrawDone ensures it has.
MaterialUI was missing the context generation counter that XMB
(commit 2ac0afa) and Ozone already have.

When threaded video is active, materialui_context_destroy runs
on the main thread (freeing textures and fonts) while
materialui_frame may be mid-render on the video thread.  The
existing texture snapshot pattern (commit 303168b) captures
handle values at frame start, but after context_destroy frees
the underlying GPU resources, those snapshot values point to
freed resources.  Drawing with freed texture handles is
undefined behaviour — typically a GPU fault or visual
corruption depending on the driver.

Similarly, context_destroy sets font pointers to NULL after
freeing.  If the video thread has already passed the font_bind
calls but hasn't reached font_flush yet, the flush dereferences
the freed font renderer data.

Fix: add the same uint32_t context_generation counter pattern
used by XMB and Ozone.

  - materialui_handle_t gains a context_generation field
  - materialui_context_destroy increments it before freeing
    any resources
  - materialui_frame snapshots it at frame start, checks before
    font_bind (first resource access) and before font_unbind
    (after all drawing), bailing to ctx_destroyed on mismatch
  - ctx_destroyed label skips font unbind and viewport restore,
    falling through to scope cleanup

The generation counter is a uint32_t that wraps at 4 billion
increments — effectively never wraps in practice since context
destroy/reset cycles happen at most a few times per session.

No behavioural change when threaded video is disabled (the
generation never changes mid-frame on a single thread).
D3D9 Cg, D3D9 HLSL, D3D10, D3D11, D3D12, and Metal each had
a static gfx_ctx_driver_t installed during init so that
video_context_driver_get_flags() would return valid flags.
These drivers provide flags through their poke interface, not
through a context driver, but video_context_driver_get_flags()
only checked the context driver.

This affected any code path calling the function — shader
preset loading, Shaders menu visibility, shader type
validation, and shader manager initialization.

Fix: make video_context_driver_get_flags() merge flags from
both sources — context driver first, then OR in poke flags.

For GL/Vulkan: the context driver provides platform flags
(shader types, adaptive vsync, core context) and the poke
interface provides driver flags (hard sync, BFI, screenshots).
Both sets are now returned together.

For D3D/Metal: no context driver exists, so context flags are
empty.  The poke interface provides all flags.  This is what
the fake context was simulating.

With this change, the fake context workaround is unnecessary:

  video_driver.c:
    - video_context_driver_get_flags merges both flag sources
    - Shader preset loading moved from driver init to
      video_driver_init_internal, after poke registration
    - Guarded by !current_video_context.get_flags (skips
      GL/Vulkan which load shaders via real context driver)
    - Guarded by non-empty preset path (avoids pointless
      device reinit on D3D9 for NULL path)

  D3D9 Cg, D3D9 HLSL, D3D10, D3D11, D3D12:
    - Removed static fake_context variable
    - Removed fake context setup + shader loading from init

  Metal:
    - Removed 45-line fake context struct
    - Removed no-op metal_ctx_swap_interval
    - Removed fake context setup + shader loading from init
    - Direct metal_ctx_swap_buffers(NULL) call replaces
      indirect metal_fake_context.swap_buffers(NULL)
    - Removed unused shader_path variable

Note: D3D9 HLSL/Cg crash when toggling threaded video.
This is pre-existing and unrelated to this change.
video_context_driver_get_flags() now merges flags from both
the poke interface and the context driver (commit be2d747).
This made two functions redundant:

  - video_driver_get_flags (static) — checked poke only
  - video_driver_get_flags_wrapper — checked poke then context

Both are subsets of what video_context_driver_get_flags() now
does.  Remove them and simplify video_driver_test_all_flags()
to a single video_context_driver_get_flags() call.
The task->error field was in practice only used as a flag, actual
message was never shown to the user.
D3D8, D3D9 HLSL, and D3D9 Cg created their Win32 window by
calling win32_set_style, win32_window_create, and
win32_set_window directly.  D3D10, D3D11, and D3D12 use the
shared win32_set_video_mode helper which does the same three
calls plus a GetMessage dispatch loop.

Replace the manual sequence with win32_set_video_mode in all
three drivers for consistency.  Removed now-unused style and
rect local variables.

Note: D3D9 HLSL/Cg still crash when toggling threaded video.
The crash is a COM double-release during driver teardown
(Release called on a stale IDirect3D9 COM pointer), which is
a pre-existing bug unrelated to window creation.
The threaded video wrapper's custom command pipeline used
32-bit types for texture handles:

  - custom_command_method_t returned int (32-bit)
  - custom_command.return_value was int (32-bit)
  - video_thread_texture_handle returned unsigned (32-bit)

On 64-bit platforms, texture handles are pointers (uintptr_t).
When a D3D9 COM pointer has upper 32 bits set (e.g.
0x0000020C4E7E2800), the return value is truncated to
0x4E7E2800.  The unload path then calls Release() on this
truncated pointer, crashing with an access violation.

This is the root cause of the D3D9 HLSL/Cg crash when running
with threaded video on 64-bit Windows.  It affects any driver
whose GPU allocations return addresses above 4GB.

Fix: change the entire pipeline to use uintptr_t:

  video_thread_wrapper.h:
    - custom_command_method_t returns uintptr_t
    - custom_command.return_value is uintptr_t
    - video_thread_texture_handle returns uintptr_t

  video_thread_wrapper.c:
    - Matching implementation change

  All video drivers + font_driver.c:
    - All _wrap callback functions return uintptr_t

14 files changed, 28 insertions, 28 deletions.
metal_texture_load_wrap was missed in commit b1dc1f3 because
the .m file wasn't matched by the sed pattern used for .c
files.  Fixes build error on macOS/iOS.
The centralized shader loading in video_driver_init_internal
called set_shader even with a NULL/empty preset path.  On
D3D9, set_shader with no path triggers d3d9_hlsl_restore
which does a full deinitialize + initialize cycle — tearing
down and rebuilding the render chain, fonts, and vertex
declarations that were just created during init.

This is wasteful on all drivers and actively harmful on D3D9
where the unnecessary device reinit cycle can leave stale
COM references.

Guard with a non-empty preset check.  If no shader preset
is configured, the driver's stock shader from init is already
correct.
g_pD3D9 was a global extern shared between the D3D9 HLSL and
D3D9 Cg drivers.  g_pD3D8 was a file-scoped static in the
D3D8 driver.  Both stored the IDirect3D object (the factory
used to create the device) as global state.

This is problematic because:
  - The g_pD3D9 extern couples two independent drivers
  - Global COM pointers survive driver reinit and can become
    stale if teardown order is wrong
  - Each driver instance should own its own IDirect3D object

Fix: move the IDirect3D objects into their respective driver
structs.

  d3d9_common.h:
    - Added LPDIRECT3D9 d3d9 field to d3d9_video_t
    - Removed extern LPDIRECT3D9 g_pD3D9

  d3d9_common.c:
    - Removed LPDIRECT3D9 g_pD3D9 global
    - d3d9_get_color_format_backbuffer takes LPDIRECT3D9 param

  d3d9hlsl.c, d3d9cg.c:
    - All g_pD3D9 references replaced with d3d->d3d9

  d3d8.c:
    - Added LPDIRECT3D8 d3d8 field to d3d8_video_t
    - Removed static LPDIRECT3D8 g_pD3D8
    - All g_pD3D8 references replaced with d3d->d3d8
    - d3d8_get_color_format_backbuffer takes LPDIRECT3D8 param
@pull pull bot locked and limited conversation to collaborators Apr 16, 2026
@pull pull bot added the ⤵️ pull label Apr 16, 2026
@pull pull bot merged commit 4996391 into Alexandre1er:master Apr 16, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants