Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for the P010 (AV_PIX_FMT_P010LE) 10-bit semi-planar YUV 4:2:0 pixel format, including validated frame representation, row-walker API, RGB conversion row primitives, and SIMD backends.
Changes:
- Introduce
P010Frame+ validation (including optional low-bit checking) and a newyuv::P010row-walker (p010_to) withP010Row/P010Sink. - Add P010→RGB row primitives for both
u8RGB and native-depthu16RGB, with SIMD implementations (NEON, SSE4.1, AVX2, AVX-512BW, wasm simd128) plus scalar reference and equivalence tests. - Extend
MixedSinkerto consume P010 rows and optionally populate bothu8RGB andu16RGB outputs; add benchmarks for P010 throughput.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/yuv/p010.rs | New P010 source-format marker, row type, sink trait, and row-walker (p010_to). |
| src/yuv/mod.rs | Register/export the new P010 module and public API. |
| src/frame.rs | Add P010Frame + errors and validation (incl. optional sample low-bit scan). |
| src/sinker/mixed.rs | Add MixedSinker<P010> implementation and tests for P010 behavior. |
| src/row/scalar.rs | Add scalar P010→RGB (u8 and u16) kernels and tests. |
| src/row/mod.rs | Add public P010 row dispatchers with SIMD selection. |
| src/row/arch/neon.rs | NEON P010 SIMD kernels + scalar-equivalence tests. |
| src/row/arch/x86_sse41.rs | SSE4.1 P010 SIMD kernels + scalar-equivalence tests. |
| src/row/arch/x86_avx2.rs | AVX2 P010 SIMD kernels + scalar-equivalence tests. |
| src/row/arch/x86_avx512.rs | AVX-512BW P010 SIMD kernels + scalar-equivalence tests. |
| src/row/arch/wasm_simd128.rs | wasm simd128 P010 SIMD kernels + scalar-equivalence tests. |
| benches/p010_to_rgb.rs | New Criterion benchmarks for P010→RGB (u8 and u16). |
| Cargo.toml | Register the new p010_to_rgb bench target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 09:29:01 UTC Benchmark Results for macos-aarch64-neonSystem Information
allBenchmark Results for macos-aarch64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-avx2-maxSystem Information
allBenchmark Results for ubuntu-x86_64-defaultSystem Information
allBenchmark Results for ubuntu-x86_64-nativeSystem Information
allBenchmark Results for ubuntu-x86_64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-sse41-maxSystem Information
allBenchmark Results for windows-x86_64-defaultSystem Information
allView detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 09:36:57 UTC Benchmark Results for macos-aarch64-neonSystem Information
allBenchmark Results for macos-aarch64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-avx2-maxSystem Information
allBenchmark Results for ubuntu-x86_64-defaultSystem Information
allBenchmark Results for ubuntu-x86_64-nativeSystem Information
allBenchmark Results for ubuntu-x86_64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-sse41-maxSystem Information
allBenchmark Results for windows-x86_64-defaultSystem Information
allView detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let expected_elements = self.frame_bytes(3)?; | ||
| if buf.len() < expected_elements { | ||
| return Err(MixedSinkerError::RgbU16BufferTooShort { | ||
| expected: expected_elements, | ||
| actual: buf.len(), | ||
| }); | ||
| } |
There was a problem hiding this comment.
MixedSinkerError::RgbU16BufferTooShort’s doc comment says the rgb_u16 buffer is only written by Yuv420p10, but this PR also writes it for P010. Please update that error variant’s documentation to reflect that P010 can populate rgb_u16 as well (and any other formats that do).
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 09:43:27 UTC Benchmark Results for macos-aarch64-neonSystem Information
allBenchmark Results for macos-aarch64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-avx2-maxSystem Information
allBenchmark Results for ubuntu-x86_64-defaultSystem Information
allBenchmark Results for ubuntu-x86_64-nativeSystem Information
allBenchmark Results for ubuntu-x86_64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-sse41-maxSystem Information
allBenchmark Results for windows-x86_64-defaultSystem Information
allView detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
Summary
Ships
AV_PIX_FMT_P010LEend-to-end — the HDR hardware-decode keystoneformat emitted by Apple VideoToolbox, VA-API, NVDEC, D3D11VA, and Intel
QSV for 10-bit HEVC / AV1 output. Two output paths: u8 RGB (fast,
downshifts 10→8 in a single Q15 shift) and native-depth u16 RGB
(lossless, low-bit-packed
yuv420p10le-style for downstream HDRconsumers).
Builds on the Ship 2 (yuv420p10) u16 kernel template — same Q15
pipeline, same
range_params_n<10, OUT_BITS>, same chroma-bias math.The new per-backend work is entirely in the u16 semi-planar UV
deinterleave and the
sample >> 6packing extraction (P010's 10active bits live in the high 10 of each u16).
API additions
frame::P010Frame+P010FrameError+ opt-intry_new_checked(low-6-bits-zero sanity check — not a provenance validator; see
docstring).
yuv::{P010, P010Row, P010Sink, p010_to}— marker, row struct,Sink subtrait, row walker.
row::p010_to_rgb_row(u8 out) +row::p010_to_rgb_u16_row(u16native-depth out) dispatchers with SIMD/scalar toggle.
MixedSinker<P010>impl —with_rgb(u8), type-gatedwith_rgb_u16(low-bit-packed),with_luma(>> 8extracts thetop byte of the high-bit-packed sample),
with_hsv(via u8 RGBscratch).
RowSlice::UvHalf10for the semi-planar u16 UV row, newP010FrameError::SampleLowBitsSetwithP010FramePlaneenum.Kernel design
Per-backend u16 UV deinterleave:
vld2q_u16(single instruction — ideal case)._mm_shuffle_epi8per-128-lane split +_mm_unpacklo/hi_epi64combine._mm256_shuffle_epi8+_mm256_permute4x64_epi64+_mm256_permute2x128_si256._mm512_shuffle_epi8+_mm512_permutexvar_epi64+_mm512_permutex2var_epi64(no DQ required).u8x16_swizzle+i8x16_shufflecombine.After deinterleave +
sample >> 6, every backend runs the same Q15pipeline as yuv420p10 — scale, upsample, clamp, narrow, write.
Cross-format equivalence test: same logical 10-bit samples fed
through
yuv420p10(low-packed) andP010(high-packed) producebyte-identical u8 RGB. Verified across every matrix × range.
SIMD backends
All 5 shipped with scalar-equivalence tests across every matrix
(BT.601/709/2020-NCL/SMPTE240M/FCC/YCgCo) × both range modes × tail
widths (18/30/34/1922/1920). NEON has an additional
out-of-range-input adversarial regression (p010-style, random noise,
all-bits-set variants).
CI benchmark (1920px row, scalar → SIMD, ns/iter)
Results from the bench runner across all configured tiers (full
report):
Notes:
colconv_force_scalarrows validate the dispatch gate.4.5–6.2×), reflecting the added u16 UV deinterleave cost that
semi-planar layouts pay over planar — same delta seen in
NV12 vs YUV420p in earlier ships.
write_rgb_u16_8uses 9shuffles + 6 ORs per 8 u16-pixel chunk, a per-pixel cost that
wider tiers absorb better. Room for optimization in a follow-up.
Test plan
cargo test --lib— 160 tests pass (NEON native)cargo build --lib --target x86_64-pc-windows-msvc— cleancargo build --lib --target wasm32-unknown-unknown(simd128) — cleanp010_to_rgb— 2.4–5.7× SIMD speedup across aarch64 + x86_64 + Windows tiers (see table)🤖 Generated with Claude Code