[pull] master from libretro:master#935
Merged
pull[bot] merged 7 commits intoAlexandre1er:masterfrom Apr 17, 2026
Merged
Conversation
a workaround/hack since we don't have a proper solution
win32_common.c mixed window/gfx infrastructure with pure UI concerns (menu bar, file browser, content loading, "Pick Core" dialog). Move 936 lines of UI code to ui_win32.c where it belongs. Functions moved: - pick_core_proc, win32_resources_pick_core_dialog, align_dword, append_wstr (dialog construction) - win32_load_content_from_gui, win32_drag_query_file (content loading) - win32_browser, g_win32_browser_mode (file browser wrapper) - win32_menu_loop (WM_COMMAND dispatch for menu bar) - win32_resources_create_menu (programmatic menu bar) - menu_id_to_label_enum, menu_id_to_meta_key, win32_meta_key_to_name, win32_localize_menu (menu localisation) The UI resource ID enum moves to ui_win32.h so both translation units can reference it. Functions called cross-file (win32_drag_query_file, win32_menu_loop, win32_load_content_from_gui, win32_localize_menu) become non-static and are declared in ui_win32.h. Existing #ifdef guards (HAVE_MENU, HAVE_THREADS, __WINRT__, LEGACY_WIN32) are preserved. No functional change. win32_common.c: 3403 -> 2413 lines ui_win32.c: 490 -> 1458 lines
for --disable-menu
…ast-path threshold Two independent performance improvements to the async image loader path used by menu thumbnails, wallpapers, and icons. 1) task_image.c: upscale_image() now uses malloc instead of calloc. The nearest-neighbour scale loop writes every destination pixel before returning: the x_src expansion loop fills the top row of each scale_factor-high block, then memcpy duplicates it into the remaining rows. No pixel is ever read before being written, so the zero-fill that calloc performs is pure waste -- the kernel zeros every cache line and the scale loop then overwrites every cache line, doubling the write traffic through memory. Measured on x86_64 at -O2 across typical thumbnail sizes (64x64..512x512 sources, 2x..8x scale factors): 37-55% reduction in upscale_image() wall time, consistent across runs. Larger destinations see the biggest win because zero-fill cost scales with output size. Correctness verified by running the modified loop over a deliberately poisoned (memset 0xCD) destination buffer and confirming byte-identical output to the calloc variant across 11 cases including edge cases (1x1, scale_factor=1, odd dimensions, non-square). 2) task_file_transfer.c: NBIO_SMALL_FILE_THRESHOLD raised from 256 KiB to 1 MiB. Files under this threshold finish their iterative transfer in a single tick rather than spreading work across several frames. The previous 256 KiB limit was tuned for small config files and low-res thumbnails and left modern box-art PNGs (typically 400-600 KiB at 512x720) in the multi-frame iterative path, which is visibly laggy when scrolling a playlist. A blocking 1 MiB read completes in well under a frame on every supported platform, so the larger threshold does not threaten frame pacing. The comment in the header is updated to record the rationale. No behavioural change beyond the above; both files compile cleanly with -fsyntax-only against the existing RetroArch headers.
* video: add display query slots to video_display_server_t and init early Add get_refresh_rate, get_video_output_size, get_video_output_prev, get_video_output_next, and get_metrics to the display server vtable. These operations are platform concerns (they query the display hardware) rather than driver concerns, but were previously only accessible through per-driver poke/ctx interfaces. Wire all 5 slots in dispserv_win32.c, delegating to the existing win32_get_refresh_rate, win32_get_video_output_size, win32_get_video_output_prev/next, and win32_get_metrics functions. Other display servers (x11, kms, android, apple, null) get NULL for now. Add get_display_type to frontend_ctx_driver_t so the platform can report its display type without needing a window. Implement for Win32 (compile-time constant) and Unix (runtime detection via WAYLAND_DISPLAY / DISPLAY environment variables). All other frontends (darwin, uwp, ctr, dos, gx, orbis, ps2, ps3, psp, qnx, switch, wiiu, xdk, xenon, emscripten, null) pass NULL which falls back to RARCH_DISPLAY_NONE. Move video_display_server_init() to run early - right after frontend_driver_init_first() in rarch_main() - so the display server is available before video_driver_init_internal() computes window dimensions. The existing late init inside video_driver_init_internal() remains as a safety net and will reinit if the display type changed. This is the infrastructure commit. Follow-up commits can: - Route video_driver.c dispatch through the display server instead of poke/ctx for these 5 operations - Remove the identical boilerplate wrappers from d3d10/11/12/gdi - Use display server queries for max window size instead of DEFAULT_WINDOW_AUTO_WIDTH_MAX No functional change - existing poke/ctx call paths are untouched. * Fix for frontend_driver.h - include gfx/video_defines.h
PNG decode time for RGBA images is dominated by the per-scanline
reverse filter, which walks each byte of the row with a serial
recurrence decoded[i] = raw[i] + f(decoded[i-bpp], prev[i],
prev[i-bpp]). The scalar loop stalls the pipeline on that
dependency and — for PAETH — runs two unpredictable branches per
byte.
For RGBA the recurrence distance is exactly one pixel (4 bytes),
so we can process a pixel's 4 channels in parallel inside one
SIMD register while still respecting the pixel-to-pixel chain.
This loses the per-byte branch and dependency chain completely.
Adds three helpers under the existing RPNG_SIMD_SSE2 / RPNG_SIMD_NEON
gates:
rpng_filter_sub_rgba — SUB, bpp==4
rpng_filter_avg_rgba — AVERAGE, bpp==4
rpng_filter_paeth_rgba — PAETH, bpp==4, branch-free predictor
PAETH uses the standard libpng-style branch-free selection via
max(x, -x) for 16-bit abs and cmpgt/and/andnot/or blend for the
three-way pick. All arithmetic is in 16-bit lanes to keep the
wrap-around semantics of PNG's mod-256 filter.
rpng_reverse_filter_copy_line dispatches to these when pngp->bpp
is 4 and SSE2/NEON is available; for other bpp or non-SIMD builds
the scalar paths are unchanged.
Correctness: 1805 randomised tests passed against the scalar
reference (20 widths from 1 to 1920 pixels × 30 seeds × 3 filters
+ all-zero / all-0xFF edge cases + three deliberately misaligned
input offsets exercising the memcpy load path). Output is
byte-identical.
Measured on x86-64, -O2, per-scanline wall time:
SUB AVERAGE PAETH
64 px 1.87x 2.61x 1.71x
128 px 3.33x 1.89x 1.76x
256 px 3.60x 2.20x 1.75x
512 px 3.47x 1.87x 1.72x
1024 px 4.14x 2.00x 1.68x
1920 px 3.71x 2.03x 2.14x
SUB benefits most because the scalar version is pure sequential
adds with no ILP; the SIMD version is just an add-and-chain.
AVERAGE and PAETH have more per-iteration work so the fraction
gained is smaller, but both still nearly double.
Loads and stores use memcpy into an aligned temporary rather than
casting through (int32_t*) — the scanline buffer is not guaranteed
to be 4-byte aligned at the start of every filter step. The
memcpy compiles to a single movd at -O2.
Build-gating follows the existing rpng_filter_up pattern. No new
public symbols. NEON path compiles but has not been tested on
ARM hardware in this change; structural analog of the SSE2 path.
No behavioural change for bpp != 4 or for non-SSE2/NEON builds.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )