Skip to content

Findit-AI/soundevents

Repository files navigation

soundevents

Production-oriented Rust inference for CED AudioSet sound-event classifiers — load an ONNX model, feed it 16 kHz mono audio, get back ranked RatedSoundEvent predictions with names, ids, and confidences. Long clips are handled via configurable chunking.

github LoC Build codecov

docs.rs crates.io crates.io license

Highlights

  • Drop-in CED inference — load any CED AudioSet ONNX model (or use the bundled tiny variant) and run it directly on &[f32] PCM samples. No Python, no preprocessing pipeline.
  • Typed labels, not bare integers — every prediction comes back as an EventPrediction carrying a &'static RatedSoundEvent from soundevents-dataset, so you get the canonical AudioSet name, the /m/... id, the model class index, and the confidence in one struct.
  • Compile-time class-count guarantee — the NUM_CLASSES = 527 constant comes from the rated dataset at codegen time. If a model returns the wrong number of classes you get a typed ClassifierError::UnexpectedClassCount instead of a silent mismatch.
  • Long-clip chunking built inclassify_chunked / classify_all_chunked window the input at a configurable hop, run inference on each chunk, and aggregate the per-chunk confidences with either Mean or Max. Defaults match CED's 10 s training window (160 000 samples at 16 kHz), and fixed-size chunk batches can now be packed into one model call.
  • Top-k via a tiny min-heapclassify(samples, k) does not allocate a full 527-element scores vector to find the top results.
  • Batch-ready low-level APIpredict_raw_scores_batch, predict_raw_scores_batch_flat, predict_raw_scores_batch_into, classify_all_batch, and classify_batch accept equal-length clip batches for service-layer batching.
  • Bring-your-own model or bundle one — load from a path, from in-memory bytes, or enable the bundled-tiny feature to embed models/tiny.onnx directly into your binary.

Quick start

[dependencies]
soundevents = "0.2"
use soundevents::{Classifier, Options};

fn load_mono_16k_audio(_: &str) -> Result<Vec<f32>, Box<dyn std::error::Error>> {
    Ok(vec![0.0; 16_000])
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?;

    // Bring your own decoder/resampler — soundevents expects mono f32
    // samples at 16 kHz, in [-1.0, 1.0].
    let samples: Vec<f32> = load_mono_16k_audio("clip.wav")?;

    // Top-5 predictions for a clip up to ~10 s long.
    for prediction in classifier.classify(&samples, 5)? {
        println!(
            "{:>5.1}%  {:>3}  {}  ({})",
            prediction.confidence() * 100.0,
            prediction.index(),
            prediction.name(),
            prediction.id(),
        );
    }
    Ok(())
}

Long clips: chunked inference

Classifier::classify_chunked slides a window over the input and aggregates each chunk's per-class confidences. The defaults (10 s window, 10 s hop, mean aggregation) match CED's training setup; tune them for overlap or peak-pooling.

use soundevents::{ChunkAggregation, ChunkingOptions, Classifier};

fn load_long_clip() -> Result<Vec<f32>, Box<dyn std::error::Error>> {
    Ok(vec![0.0; 320_000])
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?;
    let samples: Vec<f32> = load_long_clip()?;

    let opts = ChunkingOptions::default()
        // 5 s overlap (50%) between adjacent windows
        .with_hop_samples(80_000)
        // Batch up to 4 equal-length windows per session.run()
        .with_batch_size(4)
        // Keep the loudest detection in any window instead of averaging
        .with_aggregation(ChunkAggregation::Max);

    let top3 = classifier.classify_chunked(&samples, 3, opts)?;
    for prediction in top3 {
        println!("{}: {:.2}", prediction.name(), prediction.confidence());
    }
    Ok(())
}

Models

The four CED variants are sourced from the mispeech Hugging Face organisation, exported to ONNX, and checked into this repo under soundevents/models/. You should not normally need to download anything — git clone gives you a working classifier out of the box.

Variant File Size Hugging Face source
tiny soundevents/models/tiny.onnx 6.4 MB mispeech/ced-tiny
mini soundevents/models/mini.onnx 10 MB mispeech/ced-mini
small soundevents/models/small.onnx 22 MB mispeech/ced-small
base soundevents/models/base.onnx 97 MB mispeech/ced-base

All four expose the same input/output contract: mono f32 PCM at 16 kHz in, 527-class scores out (SAMPLE_RATE_HZ / NUM_CLASSES). They differ only in parameter count and accuracy/latency trade-off, so you can swap variants without touching application code.

Note — the four ONNX files together are ~135 MB. If you fork this repo and want to keep the working tree slim, consider tracking soundevents/models/*.onnx with git LFS.

Refreshing models from upstream

If upstream releases new weights, or you cloned without the model files, refetch them with:

# Requires huggingface_hub:  pip install --user huggingface_hub
./scripts/download_models.sh

# Or just one variant
./scripts/download_models.sh tiny

The script downloads the *.onnx artifact from each mispeech/ced-* Hugging Face repo and writes it as soundevents/models/<variant>.onnx.

See THIRD_PARTY_NOTICES.md for upstream model sources and attribution details.

Bundled tiny model

Enable the bundled-tiny feature to embed models/tiny.onnx into your binary — useful for CLI tools and self-contained services where you don't want to ship a separate model file.

soundevents = { version = "0.2", features = ["bundled-tiny"] }
# #[cfg(feature = "bundled-tiny")]
use soundevents::{Classifier, Options};

# fn main() -> Result<(), Box<dyn std::error::Error>> {
# #[cfg(feature = "bundled-tiny")]
# {
let mut classifier = Classifier::tiny(Options::default())?;
# let _ = &mut classifier;
# }
# Ok(())
# }

Features

Feature Default What you get
bundled-tiny Embeds models/tiny.onnx into the crate so Classifier::tiny() works without an external file.

The full input/output contract:

Constant Value Meaning
SAMPLE_RATE_HZ 16_000 Required input sample rate (mono f32).
DEFAULT_CHUNK_SAMPLES 160_000 Default 10 s window/hop for chunked inference.
NUM_CLASSES 527 Number of CED output classes — derived at compile time from RatedSoundEvent::events().len().

For low-level batching, every clip in predict_raw_scores_batch* / classify_*_batch must be non-empty and have the same sample count. predict_raw_scores_batch_flat returns one row-major Vec<f32>, and predict_raw_scores_batch_into lets callers reuse their own output buffer to avoid per-call result allocations. classify_chunked uses the same equal-length restriction internally when ChunkingOptions::batch_size() > 1, which is naturally satisfied for fixed-size windows and automatically falls back to smaller batches for the final short tail chunk.

Development

Regenerate the dataset from upstream sources:

cargo xtask codegen

Run the test suite:

cargo test

License

soundevents is under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE, LICENSE-MIT for details. Bundled third-party model attributions and source licenses are documented in THIRD_PARTY_NOTICES.md.

Copyright (c) 2026 FinDIT studio authors.

About

Production-oriented Rust inference for CED AudioSet sound-event classifiers — load an ONNX model, feed it 16 kHz mono audio, get back ranked RatedSoundEvent predictions with names, ids, and confidences. Long clips are handled via configurable chunking.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Generated from al8n/template-rs