Conversation
There was a problem hiding this comment.
Pull request overview
Adds batch-oriented inference APIs and batched chunk processing to the soundevents classifier, while also updating crate metadata/licensing and improving doctest coverage across the workspace for the 0.2.0 release.
Changes:
- Add equal-length batch inference APIs (
predict_raw_scores_batch*,classify_*_batch) and use batched model calls for chunked inference viaChunkingOptions::batch_size. - Tighten model output validation for batched outputs and add regression tests comparing batched vs sequential inference.
- Update crate metadata (workspace-inherited fields, docs.rs config, include/readme/keywords/categories), add packaged license files, and convert README snippets to compile-checked doctests.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| soundevents/src/lib.rs | Implements batched inference + chunk batching, adds new errors/validation, and adds tests. |
| soundevents/Cargo.toml | Bumps to 0.2.0 and improves publish/docs.rs metadata. |
| soundevents/LICENSE-MIT | Adds MIT license file to crate package contents. |
| soundevents/LICENSE-APACHE | Adds Apache 2.0 license file to crate package contents. |
| soundevents-dataset/src/lib.rs | Removes deny(warnings) from crate-level lint settings. |
| soundevents-dataset/README.md | Updates badges/links and makes examples doctest-friendly. |
| soundevents-dataset/LICENSE-MIT | Adds MIT license file to crate package contents. |
| soundevents-dataset/LICENSE-APACHE | Adds Apache 2.0 license file to crate package contents. |
| soundevents-dataset/Cargo.toml | Bumps to 0.2.0 and improves publish metadata while keeping edition/MSRV intent. |
| README.md | Documents new batching and updates examples to no_run doctests. |
| CHANGELOG.md | Reworks changelog format and adds 0.2.0/0.1.0 entries. |
| Cargo.toml | Introduces [workspace.package] and bumps workspace dependency version for soundevents-dataset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -33,9 +34,12 @@ Production-oriented Rust inference for [CED](https://arxiv.org/abs/2308.11957) A | |||
| soundevents = "0.1" | |||
There was a problem hiding this comment.
The README still recommends soundevents = "0.1", but this PR bumps the crate to 0.2.0. Using "0.1" will prevent users from getting the new batch APIs and metadata changes. Update the version in the Quick start snippet to 0.2 (or an explicit 0.2.0).
| soundevents = "0.1" | |
| soundevents = "0.2" |
| @@ -119,10 +128,18 @@ Enable the `bundled-tiny` feature to embed `models/tiny.onnx` into your binary | |||
| soundevents = { version = "0.1", features = ["bundled-tiny"] } | |||
There was a problem hiding this comment.
The bundled model example still pins soundevents = { version = "0.1", ... } even though the crate version is now 0.2.0. This will keep pulling the pre-0.2 release. Update this dependency snippet to 0.2/0.2.0.
| soundevents = { version = "0.1", features = ["bundled-tiny"] } | |
| soundevents = { version = "0.2", features = ["bundled-tiny"] } |
| - Initial public release of the ONNX Runtime wrapper for CED AudioSet classifiers. | ||
| - Added file, memory, and bundled-model loading paths plus configurable graph optimization. | ||
| - Added ranked top-k helpers, raw-score accessors, and chunked inference with mean/max aggregation. | ||
| - Added equal-length batch APIs for clip inference and chunked window batching for higher-throughput services. | ||
|
|
There was a problem hiding this comment.
The 0.1.0 changelog entry claims “Added equal-length batch APIs…”, but those APIs are introduced in this PR alongside the 0.2.0 bump. This makes the release notes misleading; consider moving that bullet into 0.2.0 (or removing it from 0.1.0) so each version’s section reflects what shipped in that release.
| let total_samples = batch_size * chunk_len; | ||
|
|
||
| self.input_scratch.clear(); | ||
| self.input_scratch.reserve(total_samples); |
There was a problem hiding this comment.
total_samples is computed as batch_size * chunk_len, which can overflow usize in release builds (wrapping silently) and lead to incorrect reserve() sizing and potential panics/UB-like behavior during extend_from_slice. Use checked_mul (and ideally also guard the subsequent reserve) and return a structured error if the requested batch is too large to pack safely.
| let total_samples = batch_size * chunk_len; | |
| self.input_scratch.clear(); | |
| self.input_scratch.reserve(total_samples); | |
| let total_samples = batch_size | |
| .checked_mul(chunk_len) | |
| .ok_or(ClassifierError::BatchTooLarge { | |
| batch_size, | |
| chunk_len, | |
| })?; | |
| self.input_scratch.clear(); | |
| self | |
| .input_scratch | |
| .try_reserve(total_samples) | |
| .map_err(|_| ClassifierError::BatchTooLarge { | |
| batch_size, | |
| chunk_len, | |
| })?; |
| @@ -0,0 +1,25 @@ | |||
| Copyright (c) 2015 The Rust Project Developers | |||
There was a problem hiding this comment.
The MIT license text here is copied verbatim from the Rust Project template (“Copyright (c) 2015 The Rust Project Developers”), but the crate README elsewhere asserts copyright by the FinDIT authors. If this crate is not actually copyrighted by the Rust Project Developers, update the copyright line (or use a generic MIT template) so the packaged license accurately reflects the project’s copyright holders.
| Copyright (c) 2015 The Rust Project Developers | |
| Copyright (c) [year] [copyright holders] |
| @@ -0,0 +1,25 @@ | |||
| Copyright (c) 2015 The Rust Project Developers | |||
There was a problem hiding this comment.
The MIT license text here uses the Rust Project’s copyright header (“Copyright (c) 2015 The Rust Project Developers”). If soundevents-dataset is copyrighted by your project authors (as indicated in the README), this should be replaced with the correct copyright notice (or a placeholder MIT template) before publishing.
| Copyright (c) 2015 The Rust Project Developers | |
| Copyright (c) [year] [copyright holders] |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::iter::from_fn(move || { | ||
| let first = chunks.next()?; | ||
| let first_len = first.len(); | ||
| let mut batch = Vec::with_capacity(options.batch_size()); | ||
| batch.push(first); | ||
|
|
||
| while batch.len() < options.batch_size() { | ||
| let Some(next_len) = chunks.peek().map(|chunk| chunk.len()) else { |
There was a problem hiding this comment.
chunk_batches preallocates Vec::with_capacity(options.batch_size()). Because batch_size is fully caller-controlled and only validated as non-zero, a very large value can cause a huge allocation (or abort) even for small clips. Consider not preallocating, or capping the capacity to a reasonable upper bound / remaining chunks.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -119,10 +130,18 @@ Enable the `bundled-tiny` feature to embed `models/tiny.onnx` into your binary | |||
| soundevents = { version = "0.1", features = ["bundled-tiny"] } | |||
There was a problem hiding this comment.
The bundled-tiny dependency snippet still uses version = "0.1", but this PR bumps the crate to 0.2.0 and earlier examples already use soundevents = "0.2". Please update this snippet to 0.2 to avoid readers copying an outdated version.
| soundevents = { version = "0.1", features = ["bundled-tiny"] } | |
| soundevents = { version = "0.2", features = ["bundled-tiny"] } |
| std::iter::from_fn(move || { | ||
| let first = chunks.next()?; | ||
| let first_len = first.len(); | ||
| let mut batch = Vec::with_capacity(options.batch_size()); |
There was a problem hiding this comment.
chunk_batches preallocates with Vec::with_capacity(options.batch_size()). Since batch_size is user-controlled and only validated as non-zero, an extremely large value can trigger an attempted huge allocation and panic/abort before any inference runs. Consider avoiding prealloc based on batch_size (e.g., start with an empty Vec and let it grow to the actual batch length, or cap the preallocation to a safe upper bound).
| let mut batch = Vec::with_capacity(options.batch_size()); | |
| let mut batch = Vec::new(); |
| /// Classify a batch of equal-length mono 16 kHz clips and return the top `k` classes for each clip. | ||
| pub fn classify_batch( | ||
| &mut self, | ||
| batch_16k: &[&[f32]], | ||
| top_k: usize, | ||
| ) -> Result<Vec<Vec<EventPrediction>>, ClassifierError> { | ||
| let _ = validate_batch_inputs(batch_16k)?; | ||
|
|
||
| if top_k == 0 { | ||
| return Ok((0..batch_16k.len()).map(|_| Vec::new()).collect()); | ||
| } | ||
|
|
||
| self.with_raw_scores_batch(batch_16k, |raw_scores, batch_size| { | ||
| raw_scores |
There was a problem hiding this comment.
classify_batch calls validate_batch_inputs(batch_16k)? and then immediately calls with_raw_scores_batch, which re-validates the same batch again. This doubles the input scan cost for large batches; consider restructuring so validation happens only once (while still short-circuiting when top_k == 0).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let expected_values = expected_batch_size.saturating_mul(NUM_CLASSES); | ||
| if raw_scores.len() != expected_values { | ||
| if raw_scores.len() % expected_batch_size.max(1) == 0 { | ||
| return Err(ClassifierError::UnexpectedClassCount { | ||
| expected: NUM_CLASSES, | ||
| actual: raw_scores.len() / expected_batch_size.max(1), | ||
| }); |
There was a problem hiding this comment.
validate_output uses saturating_mul to compute the expected element count. If expected_batch_size * NUM_CLASSES overflows, this will silently clamp and can yield confusing errors (and potentially accept a nonsensical expected_values). Consider using the existing checked_batch_len(expected_batch_size, NUM_CLASSES) here and returning ClassifierError::BatchTooLarge on overflow, and then use that checked value for the length comparison.
| let first = chunks.next()?; | ||
| let first_len = first.len(); | ||
| let mut batch = Vec::with_capacity(options.batch_size()); | ||
| batch.push(first); | ||
|
|
||
| while batch.len() < options.batch_size() { | ||
| let Some(next_len) = chunks.peek().map(|chunk| chunk.len()) else { |
There was a problem hiding this comment.
chunk_batches pre-allocates with Vec::with_capacity(options.batch_size()). Since batch_size is user-controlled, very large values can trigger a huge allocation (panic/OOM) even for short inputs. Prefer not pre-allocating from the raw batch size (e.g., start with an empty vec and let it grow), or cap/guard the capacity via a checked/try_reserve path that maps to ClassifierError::BatchTooLarge.
| } | ||
|
|
||
| fn main() -> Result<(), Box<dyn std::error::Error>> { | ||
| let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?; |
There was a problem hiding this comment.
The README examples use a repo-relative path ("soundevents/models/tiny.onnx") when calling Classifier::from_file. For crates.io users this path typically won’t exist at runtime, so the snippet is likely to fail when copied. Consider switching the example to a placeholder like "path/to/model.onnx" (and/or pointing to Classifier::tiny behind bundled-tiny) so the quick start is correct outside of a git checkout.
| let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?; | |
| let mut classifier = Classifier::from_file("path/to/model.onnx")?; |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::iter::from_fn(move || { | ||
| let first = chunks.next()?; | ||
| let first_len = first.len(); | ||
| let mut batch = Vec::with_capacity(options.batch_size()); |
There was a problem hiding this comment.
chunk_batches pre-allocates with Vec::with_capacity(options.batch_size()), but batch_size is user-controlled. A very large batch size can trigger an immediate huge allocation / OOM even if only a few chunks are actually present. Consider avoiding capacity preallocation based on batch_size (or capping it to a reasonable maximum) so chunked inference can fail gracefully rather than risking an allocation spike.
| let mut batch = Vec::with_capacity(options.batch_size()); | |
| let mut batch = Vec::new(); |
No description provided.