Skip to content

CI: Introduce FAAC Benchmark Suite for automated regression testing#78

Open
nschimme wants to merge 5 commits intoknik0:masterfrom
nschimme:benchmark
Open

CI: Introduce FAAC Benchmark Suite for automated regression testing#78
nschimme wants to merge 5 commits intoknik0:masterfrom
nschimme:benchmark

Conversation

@nschimme
Copy link
Copy Markdown
Contributor

@nschimme nschimme commented Mar 4, 2026

This PR introduces the FAAC Benchmark Suite, an automated CI/CD pipeline designed to provide objective data on every change.

Currently, the project lacks a formal regression and testing suite. For a maintainer, this makes merging optimizations or refactors a high-risk activity. This suite aims to act as a "safety net," providing the metrics needed to ensure that new code maintains the project's standards for quality, speed, and size.

The "Golden Triangle" Philosophy

I've designed the benchmarking logic around three pillars critical to the FAAC mission. Note that these are a first draft—I am completely open to adjusting this philosophy or the specific metrics based on what you value most for the project.

  1. Audio Fidelity: Uses the ViSQOL model to predict Mean Opinion Score (MOS). This ensures psychoacoustic changes don't introduce audible artifacts like "metallic" ringing.
  2. Computational Efficiency: Measures normalized throughput. While FAAC targets being fast, this ensures we don't accidentally introduce regressions that impact real-time performance on low-power cores.
  3. Minimal Footprint: Tracks the binary size of libfaac.so. For embedded VSS and IoT targets, binary size is a primary feature, and this suite makes any "bloat" immediately visible.

Implementation

  • GitHub Actions: Runs on every pull request, comparing the PR branch (Candidate) against the Master branch (Baseline).
  • Automated Reporting: Generates a high-signal Markdown report in the PR comments, highlighting regressions (red), wins (green), or bit-identical refactors (verified via MD5).
  • Datasets: Includes scripts to pull speech and music samples from TCD-VoIP and PMLT2014 to test real-world scenarios.

Focus & Feedback Requested

The primary goal of this draft is to establish the metrics. I would value your feedback on:

  • The Thresholds: Currently, a 0.1 MOS drop or a 10% throughput drop triggers a "Failure" icon. Are these the right sensitivities for you?
  • The "Why": If you feel the focus should shift (e.g., more weight on bitrate accuracy vs. throughput), I’m happy to retune the reporting logic.
  • CI Usage: To keep things conservative, we can set this to run only on a manual trigger or specifically for PRs targeting the master branch.

This is intended to be a collaborative baseline. I want to ensure the metrics we track are the ones that give you the most confidence when reviewing contributions.

Sample Report from this PR: benchmark-report-full.zip

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 4, 2026

Seems we need to tweak some permissions for it to leave a GH comment: https://github.com/knik0/faac/actions/runs/22672563501/job/65726481319?pr=78

We should have seen something like: nschimme#38

@fabiangreffrath
Copy link
Copy Markdown
Collaborator

Seems we need to tweak some permissions for it to leave a GH comment: https://github.com/knik0/faac/actions/runs/22672563501/job/65726481319?pr=78

Frankly, it feels a bit uneasy to introduce a test suite that's about as big as the library itself and that downloads some random samples from somewhere else under a questionable license.

I'll put trust in your justice if you tell me that the changes you suggest will generate output identical as before.

You know, for me this is just a little side project. I'm the last one alive here with commit rights. I haven't written a single line of the actual codec myself.

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 4, 2026

I get the hesitation, but I’m doing this specifically so you don't have to trust my 'justice.' I’ve already verified the changes are 100% bit-identical, and this suite is just the math to prove it to you so you don't have to audit code you didn't write.

On the license/size stuff: the samples aren't in the repo, the CI just pulls them to run the check. It keeps the library clean. If the suite ever becomes a maintenance headache or the 'uneasiness' doesn't go away, just rm -rf tests/ and delete the workflow. I'll be the one maintaining it anyway, so if it breaks, that's on me.

I’d rather have the data than fly blind. How about we run with it, and if it's a pain in the ass, we scrap it?

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 4, 2026

That does beg the question, do you have access to give other people write access? If not, maybe we create a new faac organization and start putting our changes there. This becomes a mirror.

@fabiangreffrath
Copy link
Copy Markdown
Collaborator

That does beg the question, do you have access to give other people write access? If not, maybe we create a new faac organization and start putting our changes there. This becomes a mirror.

I only have commit rights, I cannot change anything about the repository.

My idea is to get the remaining three PRs merged into the code (without the test suite) and release this as 1.40. Then I'd abandon this repository as well and will happily hand over maintenance to a more active fork.

And please don't forget about the brother project faad2.

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 4, 2026

Sounds good, I'll keep maintaining it on my side and then leaving comments with the results in my PRs

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 4, 2026

We could be cheeky... I see that https://github.com/FAACD is free 😈

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 5, 2026

I extracted the code out into a repo that I own and exposed it as GitHub action. This PR just uses it now. See the extracted solution at https://github.com/nschimme/faac-benchmark

@nschimme
Copy link
Copy Markdown
Contributor Author

nschimme commented Mar 5, 2026

Answering my own question, I think I'll have to tweak the thresholds for failure and wins a bit (and possibly make changes together). I'll post this table here for our reference:

### FAAC Optimization Impact Estimates (2026 Roadmap)

| Task / Feature               | MOS Impact | Throughput (CPU) | Binary Size |
|:-----------------------------|:-----------|:-----------------|:------------|
| Adaptive Rounding (AQR)      | +15-20%    | 0% (Negligible)  | <1%         |
| MDCT-based Psychoacoustic    | +5-10%     | +30-40%          | +2-5%       |
| Stereo Mode Hysteresis       | +5%        | 0%               | <1%         |
| Transient Detection Tuning   | +10%       | 0%               | <1%         |
| ATH Scaling (VoIP/VSS)       | +5%        | 0%               | <1%         |
| Bit Reservoir Control        | +10-15%    | -5% (Overhead)   | +1-2%       |
| Temporal Noise Shaping (TNS) | +8-12%     | -10% (Complexity)| +3-5%       |

---
NOTES:
- Adaptive Rounding addresses the historical "shimmer" issues
- MDCT-PAM targets the 30%+ CPU gain and better masking alignment
- Stereo Hysteresis stabilizes the soundstage image in complex passages

@nschimme nschimme marked this pull request as draft March 6, 2026 14:37
@nschimme nschimme marked this pull request as ready for review March 14, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants