0.2.0: support batch processing by al8n · Pull Request #3 · Findit-AI/soundevents

al8n · 2026-04-08T10:20:32Z

No description provided.

Copilot

Pull request overview

Adds batch-oriented inference APIs and batched chunk processing to the soundevents classifier, while also updating crate metadata/licensing and improving doctest coverage across the workspace for the 0.2.0 release.

Changes:

Add equal-length batch inference APIs (predict_raw_scores_batch*, classify_*_batch) and use batched model calls for chunked inference via ChunkingOptions::batch_size.
Tighten model output validation for batched outputs and add regression tests comparing batched vs sequential inference.
Update crate metadata (workspace-inherited fields, docs.rs config, include/readme/keywords/categories), add packaged license files, and convert README snippets to compile-checked doctests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
soundevents/src/lib.rs	Implements batched inference + chunk batching, adds new errors/validation, and adds tests.
soundevents/Cargo.toml	Bumps to 0.2.0 and improves publish/docs.rs metadata.
soundevents/LICENSE-MIT	Adds MIT license file to crate package contents.
soundevents/LICENSE-APACHE	Adds Apache 2.0 license file to crate package contents.
soundevents-dataset/src/lib.rs	Removes `deny(warnings)` from crate-level lint settings.
soundevents-dataset/README.md	Updates badges/links and makes examples doctest-friendly.
soundevents-dataset/LICENSE-MIT	Adds MIT license file to crate package contents.
soundevents-dataset/LICENSE-APACHE	Adds Apache 2.0 license file to crate package contents.
soundevents-dataset/Cargo.toml	Bumps to 0.2.0 and improves publish metadata while keeping edition/MSRV intent.
README.md	Documents new batching and updates examples to `no_run` doctests.
CHANGELOG.md	Reworks changelog format and adds 0.2.0/0.1.0 entries.
Cargo.toml	Introduces `[workspace.package]` and bumps workspace dependency version for `soundevents-dataset`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T10:26:03Z

@@ -33,9 +34,12 @@ Production-oriented Rust inference for [CED](https://arxiv.org/abs/2308.11957) A
 soundevents = "0.1"


The README still recommends soundevents = "0.1", but this PR bumps the crate to 0.2.0. Using "0.1" will prevent users from getting the new batch APIs and metadata changes. Update the version in the Quick start snippet to 0.2 (or an explicit 0.2.0).

Suggested change

soundevents = "0.1"

soundevents = "0.2"

Copilot · 2026-04-08T10:26:03Z

@@ -119,10 +128,18 @@ Enable the `bundled-tiny` feature to embed `models/tiny.onnx` into your binary
 soundevents = { version = "0.1", features = ["bundled-tiny"] }


The bundled model example still pins soundevents = { version = "0.1", ... } even though the crate version is now 0.2.0. This will keep pulling the pre-0.2 release. Update this dependency snippet to 0.2/0.2.0.

Suggested change

soundevents = { version = "0.1", features = ["bundled-tiny"] }

soundevents = { version = "0.2", features = ["bundled-tiny"] }

Copilot · 2026-04-08T10:26:03Z

+- Initial public release of the ONNX Runtime wrapper for CED AudioSet classifiers.
+- Added file, memory, and bundled-model loading paths plus configurable graph optimization.
+- Added ranked top-k helpers, raw-score accessors, and chunked inference with mean/max aggregation.
+- Added equal-length batch APIs for clip inference and chunked window batching for higher-throughput services.
+


The 0.1.0 changelog entry claims “Added equal-length batch APIs…”, but those APIs are introduced in this PR alongside the 0.2.0 bump. This makes the release notes misleading; consider moving that bullet into 0.2.0 (or removing it from 0.1.0) so each version’s section reflects what shipped in that release.

Copilot · 2026-04-08T10:26:04Z

+    let total_samples = batch_size * chunk_len;
+
+    self.input_scratch.clear();
+    self.input_scratch.reserve(total_samples);


total_samples is computed as batch_size * chunk_len, which can overflow usize in release builds (wrapping silently) and lead to incorrect reserve() sizing and potential panics/UB-like behavior during extend_from_slice. Use checked_mul (and ideally also guard the subsequent reserve) and return a structured error if the requested batch is too large to pack safely.

Suggested change

let total_samples = batch_size * chunk_len;

self.input_scratch.clear();

self.input_scratch.reserve(total_samples);

let total_samples = batch_size

.checked_mul(chunk_len)

.ok_or(ClassifierError::BatchTooLarge {

batch_size,

chunk_len,

})?;

self.input_scratch.clear();

self

.input_scratch

.try_reserve(total_samples)

.map_err(|_| ClassifierError::BatchTooLarge {

batch_size,

chunk_len,

})?;

Copilot · 2026-04-08T10:26:04Z

@@ -0,0 +1,25 @@
+Copyright (c) 2015 The Rust Project Developers


The MIT license text here is copied verbatim from the Rust Project template (“Copyright (c) 2015 The Rust Project Developers”), but the crate README elsewhere asserts copyright by the FinDIT authors. If this crate is not actually copyrighted by the Rust Project Developers, update the copyright line (or use a generic MIT template) so the packaged license accurately reflects the project’s copyright holders.

Suggested change

Copyright (c) 2015 The Rust Project Developers

Copyright (c) [year] [copyright holders]

Copilot · 2026-04-08T10:26:04Z

@@ -0,0 +1,25 @@
+Copyright (c) 2015 The Rust Project Developers


The MIT license text here uses the Rust Project’s copyright header (“Copyright (c) 2015 The Rust Project Developers”). If soundevents-dataset is copyrighted by your project authors (as indicated in the README), this should be replaced with the correct copyright notice (or a placeholder MIT template) before publishing.

Suggested change

Copyright (c) 2015 The Rust Project Developers

Copyright (c) [year] [copyright holders]

codecov · 2026-04-08T10:37:13Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T10:58:33Z

+  std::iter::from_fn(move || {
+    let first = chunks.next()?;
+    let first_len = first.len();
+    let mut batch = Vec::with_capacity(options.batch_size());
+    batch.push(first);
+
+    while batch.len() < options.batch_size() {
+      let Some(next_len) = chunks.peek().map(|chunk| chunk.len()) else {


chunk_batches preallocates Vec::with_capacity(options.batch_size()). Because batch_size is fully caller-controlled and only validated as non-zero, a very large value can cause a huge allocation (or abort) even for small clips. Consider not preallocating, or capping the capacity to a reasonable upper bound / remaining chunks.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T11:41:05Z

@@ -119,10 +130,18 @@ Enable the `bundled-tiny` feature to embed `models/tiny.onnx` into your binary
 soundevents = { version = "0.1", features = ["bundled-tiny"] }


The bundled-tiny dependency snippet still uses version = "0.1", but this PR bumps the crate to 0.2.0 and earlier examples already use soundevents = "0.2". Please update this snippet to 0.2 to avoid readers copying an outdated version.

Suggested change

soundevents = { version = "0.1", features = ["bundled-tiny"] }

soundevents = { version = "0.2", features = ["bundled-tiny"] }

Copilot · 2026-04-08T11:41:06Z

+  std::iter::from_fn(move || {
+    let first = chunks.next()?;
+    let first_len = first.len();
+    let mut batch = Vec::with_capacity(options.batch_size());


chunk_batches preallocates with Vec::with_capacity(options.batch_size()). Since batch_size is user-controlled and only validated as non-zero, an extremely large value can trigger an attempted huge allocation and panic/abort before any inference runs. Consider avoiding prealloc based on batch_size (e.g., start with an empty Vec and let it grow to the actual batch length, or cap the preallocation to a safe upper bound).

Suggested change

let mut batch = Vec::with_capacity(options.batch_size());

let mut batch = Vec::new();

Copilot · 2026-04-08T11:41:06Z

+  /// Classify a batch of equal-length mono 16 kHz clips and return the top `k` classes for each clip.
+  pub fn classify_batch(
+    &mut self,
+    batch_16k: &[&[f32]],
+    top_k: usize,
+  ) -> Result<Vec<Vec<EventPrediction>>, ClassifierError> {
+    let _ = validate_batch_inputs(batch_16k)?;
+
+    if top_k == 0 {
+      return Ok((0..batch_16k.len()).map(|_| Vec::new()).collect());
+    }
+
+    self.with_raw_scores_batch(batch_16k, |raw_scores, batch_size| {
+      raw_scores


classify_batch calls validate_batch_inputs(batch_16k)? and then immediately calls with_raw_scores_batch, which re-validates the same batch again. This doubles the input scan cost for large batches; consider restructuring so validation happens only once (while still short-circuiting when top_k == 0).

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T11:58:47Z

+  let expected_values = expected_batch_size.saturating_mul(NUM_CLASSES);
+  if raw_scores.len() != expected_values {
+    if raw_scores.len() % expected_batch_size.max(1) == 0 {
+      return Err(ClassifierError::UnexpectedClassCount {
+        expected: NUM_CLASSES,
+        actual: raw_scores.len() / expected_batch_size.max(1),
+      });


validate_output uses saturating_mul to compute the expected element count. If expected_batch_size * NUM_CLASSES overflows, this will silently clamp and can yield confusing errors (and potentially accept a nonsensical expected_values). Consider using the existing checked_batch_len(expected_batch_size, NUM_CLASSES) here and returning ClassifierError::BatchTooLarge on overflow, and then use that checked value for the length comparison.

Copilot · 2026-04-08T11:58:47Z

+    let first = chunks.next()?;
+    let first_len = first.len();
+    let mut batch = Vec::with_capacity(options.batch_size());
+    batch.push(first);
+
+    while batch.len() < options.batch_size() {
+      let Some(next_len) = chunks.peek().map(|chunk| chunk.len()) else {


chunk_batches pre-allocates with Vec::with_capacity(options.batch_size()). Since batch_size is user-controlled, very large values can trigger a huge allocation (panic/OOM) even for short inputs. Prefer not pre-allocating from the raw batch size (e.g., start with an empty vec and let it grow), or cap/guard the capacity via a checked/try_reserve path that maps to ClassifierError::BatchTooLarge.

Copilot · 2026-04-08T11:58:47Z

+}
+
 fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?;


The README examples use a repo-relative path ("soundevents/models/tiny.onnx") when calling Classifier::from_file. For crates.io users this path typically won’t exist at runtime, so the snippet is likely to fail when copied. Consider switching the example to a placeholder like "path/to/model.onnx" (and/or pointing to Classifier::tiny behind bundled-tiny) so the quick start is correct outside of a git checkout.

Suggested change

let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?;

let mut classifier = Classifier::from_file("path/to/model.onnx")?;

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T12:20:10Z

+  std::iter::from_fn(move || {
+    let first = chunks.next()?;
+    let first_len = first.len();
+    let mut batch = Vec::with_capacity(options.batch_size());


chunk_batches pre-allocates with Vec::with_capacity(options.batch_size()), but batch_size is user-controlled. A very large batch size can trigger an immediate huge allocation / OOM even if only a few chunks are actually present. Consider avoiding capacity preallocation based on batch_size (or capping it to a reasonable maximum) so chunked inference can fail gracefully rather than risking an allocation spike.

Suggested change

let mut batch = Vec::with_capacity(options.batch_size());

let mut batch = Vec::new();

al8n added 2 commits April 8, 2026 22:15

support batching processing

7739d37

support batching processing

a254dde

al8n requested a review from Copilot April 8, 2026 10:20

Copilot started reviewing on behalf of al8n April 8, 2026 10:21 View session