Skip to content

[pull] master from libretro:master#935

Merged
pull[bot] merged 7 commits intoAlexandre1er:masterfrom
libretro:master
Apr 17, 2026
Merged

[pull] master from libretro:master#935
pull[bot] merged 7 commits intoAlexandre1er:masterfrom
libretro:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 17, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

LibretroAdmin and others added 7 commits April 17, 2026 12:48
a workaround/hack since we don't have a proper solution
win32_common.c mixed window/gfx infrastructure with pure UI concerns
(menu bar, file browser, content loading, "Pick Core" dialog). Move
936 lines of UI code to ui_win32.c where it belongs.

Functions moved:
- pick_core_proc, win32_resources_pick_core_dialog, align_dword,
  append_wstr (dialog construction)
- win32_load_content_from_gui, win32_drag_query_file (content loading)
- win32_browser, g_win32_browser_mode (file browser wrapper)
- win32_menu_loop (WM_COMMAND dispatch for menu bar)
- win32_resources_create_menu (programmatic menu bar)
- menu_id_to_label_enum, menu_id_to_meta_key, win32_meta_key_to_name,
  win32_localize_menu (menu localisation)

The UI resource ID enum moves to ui_win32.h so both translation units
can reference it. Functions called cross-file (win32_drag_query_file,
win32_menu_loop, win32_load_content_from_gui, win32_localize_menu)
become non-static and are declared in ui_win32.h. Existing #ifdef
guards (HAVE_MENU, HAVE_THREADS, __WINRT__, LEGACY_WIN32) are
preserved.

No functional change.

win32_common.c: 3403 -> 2413 lines
ui_win32.c:      490 -> 1458 lines
…ast-path threshold

Two independent performance improvements to the async image loader
path used by menu thumbnails, wallpapers, and icons.

1) task_image.c: upscale_image() now uses malloc instead of calloc.

   The nearest-neighbour scale loop writes every destination pixel
   before returning: the x_src expansion loop fills the top row of
   each scale_factor-high block, then memcpy duplicates it into the
   remaining rows.  No pixel is ever read before being written, so
   the zero-fill that calloc performs is pure waste -- the kernel
   zeros every cache line and the scale loop then overwrites every
   cache line, doubling the write traffic through memory.

   Measured on x86_64 at -O2 across typical thumbnail sizes
   (64x64..512x512 sources, 2x..8x scale factors): 37-55% reduction
   in upscale_image() wall time, consistent across runs.  Larger
   destinations see the biggest win because zero-fill cost scales
   with output size.

   Correctness verified by running the modified loop over a
   deliberately poisoned (memset 0xCD) destination buffer and
   confirming byte-identical output to the calloc variant across
   11 cases including edge cases (1x1, scale_factor=1, odd
   dimensions, non-square).

2) task_file_transfer.c: NBIO_SMALL_FILE_THRESHOLD raised from
   256 KiB to 1 MiB.

   Files under this threshold finish their iterative transfer in a
   single tick rather than spreading work across several frames.
   The previous 256 KiB limit was tuned for small config files and
   low-res thumbnails and left modern box-art PNGs (typically
   400-600 KiB at 512x720) in the multi-frame iterative path,
   which is visibly laggy when scrolling a playlist.

   A blocking 1 MiB read completes in well under a frame on every
   supported platform, so the larger threshold does not threaten
   frame pacing.  The comment in the header is updated to record
   the rationale.

No behavioural change beyond the above; both files compile cleanly
with -fsyntax-only against the existing RetroArch headers.
* video: add display query slots to video_display_server_t and init early

Add get_refresh_rate, get_video_output_size, get_video_output_prev,
get_video_output_next, and get_metrics to the display server vtable.
These operations are platform concerns (they query the display
hardware) rather than driver concerns, but were previously only
accessible through per-driver poke/ctx interfaces.

Wire all 5 slots in dispserv_win32.c, delegating to the existing
win32_get_refresh_rate, win32_get_video_output_size,
win32_get_video_output_prev/next, and win32_get_metrics functions.
Other display servers (x11, kms, android, apple, null) get NULL
for now.

Add get_display_type to frontend_ctx_driver_t so the platform can
report its display type without needing a window.  Implement for
Win32 (compile-time constant) and Unix (runtime detection via
WAYLAND_DISPLAY / DISPLAY environment variables).  All other
frontends (darwin, uwp, ctr, dos, gx, orbis, ps2, ps3, psp,
qnx, switch, wiiu, xdk, xenon, emscripten, null) pass NULL
which falls back to RARCH_DISPLAY_NONE.

Move video_display_server_init() to run early - right after
frontend_driver_init_first() in rarch_main() - so the display
server is available before video_driver_init_internal() computes
window dimensions.  The existing late init inside
video_driver_init_internal() remains as a safety net and will
reinit if the display type changed.

This is the infrastructure commit.  Follow-up commits can:
- Route video_driver.c dispatch through the display server
  instead of poke/ctx for these 5 operations
- Remove the identical boilerplate wrappers from d3d10/11/12/gdi
- Use display server queries for max window size instead of
  DEFAULT_WINDOW_AUTO_WIDTH_MAX

No functional change - existing poke/ctx call paths are untouched.

* Fix for frontend_driver.h - include gfx/video_defines.h
PNG decode time for RGBA images is dominated by the per-scanline
reverse filter, which walks each byte of the row with a serial
recurrence decoded[i] = raw[i] + f(decoded[i-bpp], prev[i],
prev[i-bpp]).  The scalar loop stalls the pipeline on that
dependency and — for PAETH — runs two unpredictable branches per
byte.

For RGBA the recurrence distance is exactly one pixel (4 bytes),
so we can process a pixel's 4 channels in parallel inside one
SIMD register while still respecting the pixel-to-pixel chain.
This loses the per-byte branch and dependency chain completely.

Adds three helpers under the existing RPNG_SIMD_SSE2 / RPNG_SIMD_NEON
gates:

  rpng_filter_sub_rgba     — SUB,    bpp==4
  rpng_filter_avg_rgba     — AVERAGE, bpp==4
  rpng_filter_paeth_rgba   — PAETH,  bpp==4, branch-free predictor

PAETH uses the standard libpng-style branch-free selection via
max(x, -x) for 16-bit abs and cmpgt/and/andnot/or blend for the
three-way pick.  All arithmetic is in 16-bit lanes to keep the
wrap-around semantics of PNG's mod-256 filter.

rpng_reverse_filter_copy_line dispatches to these when pngp->bpp
is 4 and SSE2/NEON is available; for other bpp or non-SIMD builds
the scalar paths are unchanged.

Correctness: 1805 randomised tests passed against the scalar
reference (20 widths from 1 to 1920 pixels × 30 seeds × 3 filters
+ all-zero / all-0xFF edge cases + three deliberately misaligned
input offsets exercising the memcpy load path).  Output is
byte-identical.

Measured on x86-64, -O2, per-scanline wall time:

              SUB           AVERAGE       PAETH
  64 px      1.87x         2.61x         1.71x
  128 px     3.33x         1.89x         1.76x
  256 px     3.60x         2.20x         1.75x
  512 px     3.47x         1.87x         1.72x
  1024 px    4.14x         2.00x         1.68x
  1920 px    3.71x         2.03x         2.14x

SUB benefits most because the scalar version is pure sequential
adds with no ILP; the SIMD version is just an add-and-chain.
AVERAGE and PAETH have more per-iteration work so the fraction
gained is smaller, but both still nearly double.

Loads and stores use memcpy into an aligned temporary rather than
casting through (int32_t*) — the scanline buffer is not guaranteed
to be 4-byte aligned at the start of every filter step.  The
memcpy compiles to a single movd at -O2.

Build-gating follows the existing rpng_filter_up pattern.  No new
public symbols.  NEON path compiles but has not been tested on
ARM hardware in this change; structural analog of the SSE2 path.

No behavioural change for bpp != 4 or for non-SSE2/NEON builds.
@pull pull bot locked and limited conversation to collaborators Apr 17, 2026
@pull pull bot added the ⤵️ pull label Apr 17, 2026
@pull pull bot merged commit 3464ffe into Alexandre1er:master Apr 17, 2026
17 of 36 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant