Seatbelt Crate Analysis
Analysis of the seatbelt crate covering performance, correctness, and usability.
Performance
1. Circuit Breaker Engine Map Uses Mutex<HashMap> — Consider a Concurrent Map
File: breaker/engine/engines.rs:15-17
Engines wraps its map in a Mutex<HashMap<BreakerId, Arc<Engine>>>. Every execute() call acquires this mutex to look up (or insert) an engine by BreakerId. Under high concurrency with per-host breaker IDs, this serializes all lookups through a single lock.
Suggestion: Replace with a concurrent map (e.g., dashmap or a sharded lock scheme) to allow concurrent reads without contention. Alternatively, use a RwLock so that the common read-path (engine already exists) doesn't block other readers.
2. Telemetry Attribute Arrays Are Allocated on Every Event
Files: breaker/engine/engine_telemetry.rs, timeout/service.rs:200-208, retry/service.rs:236-248
Each telemetry report constructs a fresh &[KeyValue] array with Cow::clone() on every event. For hot paths (every retry attempt, every circuit breaker enter/exit, every timeout), this causes small but repeated heap allocations for the Cow::Owned pipeline/strategy names.
Suggestion: Pre-allocate a base Vec<KeyValue> containing the static pipeline/strategy/event attributes in the *Shared struct at construction time. On each call, clone or extend only the dynamic attributes (e.g., attempt index). This avoids re-cloning Cow<str> values on every invocation.
3. BackoffOptions Is Cloned on Every delays() Call
File: retry/backoff.rs:49-55
DelayBackoff::delays() clones the entire BackoffOptions (including the Rnd enum which contains an Arc) into each DelaysIter. Since a new iterator is created per retry cycle, this is a clone per request.
Impact: Minor — the clone is cheap (a few fields + an Arc::clone). Noting it only because in extreme throughput scenarios it's a small inefficiency. No change recommended unless profiling shows it as a hotspot.
4. HealthMetrics Window Cleanup Is Linear
File: breaker/health.rs:119-124
The record() method removes expired windows with a while loop calling pop_front(). With 10 fixed windows this is bounded and acceptable. However, the health_info() method iterates all windows to sum successes/failures on every call.
Impact: Negligible with WINDOW_COUNT = 10. If the window count ever becomes configurable and grows large, consider maintaining running totals.
Correctness
5. Metrics VERSION Constant Is Stale
File: metrics.rs:8
const VERSION: &str = "v0.1.0";
The crate version in Cargo.toml is 0.3.0, but the OpenTelemetry instrumentation scope reports v0.1.0. This means metrics consumers see the wrong version. This hardcoded constant will silently drift on every release.
Suggestion: Either use env!("CARGO_PKG_VERSION") to derive it automatically, or at minimum update it to "v0.3.0" and add a test that asserts consistency.
6. failure_threshold Clamps High Values but Allows Negative Values
File: breaker/layer.rs:157-159
pub fn failure_threshold(mut self, threshold: f32) -> Self {
self.failure_threshold = threshold.min(1.0);
self
}
The threshold is clamped to 1.0 on the upper end, but negative values (e.g., -0.5) pass through unchecked. A negative threshold would cause the circuit breaker to open immediately on any traffic exceeding min_throughput, since failure_rate >= negative_threshold is always true.
Suggestion: Clamp to threshold.clamp(0.0, 1.0) or at minimum threshold.max(0.0).min(1.0).
7. Window::update Doesn't Use Saturating Arithmetic
File: breaker/health.rs:169-174
fn update(&mut self, result: ExecutionResult) {
match result {
ExecutionResult::Success => self.successes += 1,
ExecutionResult::Failure => self.failures += 1,
}
}
While HealthInfo::new and Stats methods use saturating_add, individual Window counters use plain +=. Under extreme traffic within a single window, this could theoretically overflow in debug builds (panic) or wrap in release builds.
Suggestion: Use saturating_add(1) for consistency with the rest of the crate.
8. Typo in Telemetry Field Name: successfull
File: breaker/engine/engine_telemetry.rs:162
circuit_breaker.probes.successfull = stats.probes_successes,
The log field name is misspelled as successfull (double l) instead of successful. Consumers querying or filtering logs by this field name will need to use the misspelled version.
9. RecoveryKind Wildcard Match May Mask New Variants
File: retry/service.rs:184
RecoveryKind::Never | RecoveryKind::Unknown | _ => false,
The wildcard _ silently handles any future RecoveryKind variants as non-recoverable. If a new variant is added that should be recoverable, this match arm would silently suppress it. The #[non_exhaustive] attribute on RecoveryKind makes this a practical concern.
Suggestion: Remove the wildcard and explicitly match all known variants, or add a comment explaining the deliberate catch-all design decision. Note: this is behind mutants::skip so it may be intentional.
10. ExecutionResult::from_recovery Treats Unknown as Success
File: breaker/execution_result.rs:27-33
match recovery.kind() {
RecoveryKind::Retry | RecoveryKind::Unavailable => Self::Failure,
_ => Self::Success,
}
RecoveryKind::Unknown maps to Success, which means the circuit breaker won't count unknown recoveries as failures. This might be intentional (conservative: don't trip breaker on unknowns), but it's worth documenting the design choice explicitly since it differs from how one might intuitively expect "unknown" to be treated.
Usability
11. ResilienceContext Requires Explicit Type Parameters
File: context.rs:16
pub struct ResilienceContext<In, Out> { ... }
The In and Out phantom type parameters exist only to carry type information for the middleware pipeline. Users must specify them explicitly in some contexts (e.g., ResilienceContext::<String, Result<String, String>>::new(&clock)), which is verbose. The types are only used to constrain layers at build time, not at construction time.
Suggestion: Consider whether ResilienceContext could be type-erased and have the layer methods constrain In/Out instead, or provide a convenience constructor that infers types from usage.
12. No Default Implementation for HalfOpenMode
File: breaker/half_open_mode.rs
HalfOpenMode doesn't implement Default, even though BreakerLayer defaults to progressive(None). Adding impl Default for HalfOpenMode with progressive(None) would let users write HalfOpenMode::default() and align with the documented default.
13. BreakerLayer::layer() Can Panic at Runtime Despite Type-State Pattern
File: breaker/layer.rs:348-349
recovery: self.recovery.clone().expect("recovery must be set in Ready state"),
rejected_input: self.rejected_input.clone().expect("rejected_input must be set in Ready state"),
The type-state pattern (Set/NotSet) prevents calling layer() without setting required fields, but the actual values are stored as Option<T>. The .expect() calls should never fire, but they represent a mismatch between the type-level and value-level guarantees. The same pattern exists in RetryLayer and TimeoutLayer.
Suggestion: This is a minor internal concern (not a user-facing bug) since the type system prevents the invalid state. However, storing the values directly (not wrapped in Option) in the Set state would eliminate the .expect() calls entirely.
14. min_throughput Accepts 0 Without Warning
File: breaker/layer.rs:170-173
Setting min_throughput(0) means the circuit breaker can trip on the very first failure, regardless of how little traffic there is. This is likely never the intended behavior and could cause immediate circuit opening in low-traffic scenarios.
Suggestion: Either enforce a minimum of 1, or document the 0 behavior explicitly.
15. No Way to Inspect or Reset Circuit Breaker State
File: breaker/service.rs, breaker/engine/engines.rs
Users cannot programmatically query the circuit breaker's current state (open/closed/half-open) or manually reset it. This makes operational tasks difficult — for example, after deploying a fix for a downstream service, operators cannot force the circuit closed to resume traffic immediately.
Suggestion: Expose a state() or is_open() method on Breaker, and consider a reset() method for operational needs.
16. Circuit Breaker Engine Map Grows Without Bound
File: breaker/engine/engines.rs:32-42
The Engines map never evicts entries. If BreakerId values are derived from dynamic input (e.g., per-host circuit breakers and hosts come and go), the map will grow indefinitely. The doc comment on BreakerId warns about this, but there's no runtime protection.
Suggestion: Consider adding an optional eviction policy (LRU, TTL, or max-entries cap) to prevent unbounded memory growth in long-running services.
Summary
| # |
Category |
Severity |
Item |
| 1 |
Performance |
Medium |
Mutex<HashMap> in engine map serializes lookups |
| 2 |
Performance |
Low |
Telemetry attributes re-allocated on every event |
| 3 |
Performance |
Negligible |
BackoffOptions cloned per retry cycle |
| 4 |
Performance |
Negligible |
HealthMetrics linear window scan |
| 5 |
Correctness |
Medium |
Metrics VERSION hardcoded at v0.1.0, crate is at 0.3.0 |
| 6 |
Correctness |
Low |
failure_threshold allows negative values |
| 7 |
Correctness |
Low |
Window::update uses non-saturating arithmetic |
| 8 |
Correctness |
Low |
Typo successfull in telemetry field |
| 9 |
Correctness |
Low |
Wildcard match on RecoveryKind may mask new variants |
| 10 |
Correctness |
Info |
Unknown recovery treated as success in breaker |
| 11 |
Usability |
Low |
ResilienceContext requires explicit type parameters |
| 12 |
Usability |
Low |
No Default for HalfOpenMode |
| 13 |
Usability |
Info |
Type-state pattern still uses .expect() internally |
| 14 |
Usability |
Low |
min_throughput(0) has surprising behavior |
| 15 |
Usability |
Medium |
No way to inspect or reset circuit breaker state |
| 16 |
Usability |
Medium |
Engine map grows without bound |
Human Notes
Some of the points are valid ones, will get through these and try to apply some reasoning and then fix ones that are relevant.
Seatbelt Crate Analysis
Analysis of the
seatbeltcrate covering performance, correctness, and usability.Performance
1. Circuit Breaker Engine Map Uses
Mutex<HashMap>— Consider a Concurrent MapFile:
breaker/engine/engines.rs:15-17Engineswraps its map in aMutex<HashMap<BreakerId, Arc<Engine>>>. Everyexecute()call acquires this mutex to look up (or insert) an engine byBreakerId. Under high concurrency with per-host breaker IDs, this serializes all lookups through a single lock.Suggestion: Replace with a concurrent map (e.g.,
dashmapor a sharded lock scheme) to allow concurrent reads without contention. Alternatively, use aRwLockso that the common read-path (engine already exists) doesn't block other readers.2. Telemetry Attribute Arrays Are Allocated on Every Event
Files:
breaker/engine/engine_telemetry.rs,timeout/service.rs:200-208,retry/service.rs:236-248Each telemetry report constructs a fresh
&[KeyValue]array withCow::clone()on every event. For hot paths (every retry attempt, every circuit breaker enter/exit, every timeout), this causes small but repeated heap allocations for theCow::Ownedpipeline/strategy names.Suggestion: Pre-allocate a base
Vec<KeyValue>containing the static pipeline/strategy/event attributes in the*Sharedstruct at construction time. On each call, clone or extend only the dynamic attributes (e.g., attempt index). This avoids re-cloningCow<str>values on every invocation.3.
BackoffOptionsIs Cloned on Everydelays()CallFile:
retry/backoff.rs:49-55DelayBackoff::delays()clones the entireBackoffOptions(including theRndenum which contains anArc) into eachDelaysIter. Since a new iterator is created per retry cycle, this is a clone per request.Impact: Minor — the clone is cheap (a few fields + an
Arc::clone). Noting it only because in extreme throughput scenarios it's a small inefficiency. No change recommended unless profiling shows it as a hotspot.4.
HealthMetricsWindow Cleanup Is LinearFile:
breaker/health.rs:119-124The
record()method removes expired windows with awhileloop callingpop_front(). With 10 fixed windows this is bounded and acceptable. However, thehealth_info()method iterates all windows to sum successes/failures on every call.Impact: Negligible with
WINDOW_COUNT = 10. If the window count ever becomes configurable and grows large, consider maintaining running totals.Correctness
5. Metrics
VERSIONConstant Is StaleFile:
metrics.rs:8The crate version in
Cargo.tomlis0.3.0, but the OpenTelemetry instrumentation scope reportsv0.1.0. This means metrics consumers see the wrong version. This hardcoded constant will silently drift on every release.Suggestion: Either use
env!("CARGO_PKG_VERSION")to derive it automatically, or at minimum update it to"v0.3.0"and add a test that asserts consistency.6.
failure_thresholdClamps High Values but Allows Negative ValuesFile:
breaker/layer.rs:157-159The threshold is clamped to
1.0on the upper end, but negative values (e.g.,-0.5) pass through unchecked. A negative threshold would cause the circuit breaker to open immediately on any traffic exceedingmin_throughput, sincefailure_rate >= negative_thresholdis always true.Suggestion: Clamp to
threshold.clamp(0.0, 1.0)or at minimumthreshold.max(0.0).min(1.0).7.
Window::updateDoesn't Use Saturating ArithmeticFile:
breaker/health.rs:169-174While
HealthInfo::newandStatsmethods usesaturating_add, individualWindowcounters use plain+=. Under extreme traffic within a single window, this could theoretically overflow in debug builds (panic) or wrap in release builds.Suggestion: Use
saturating_add(1)for consistency with the rest of the crate.8. Typo in Telemetry Field Name:
successfullFile:
breaker/engine/engine_telemetry.rs:162The log field name is misspelled as
successfull(doublel) instead ofsuccessful. Consumers querying or filtering logs by this field name will need to use the misspelled version.9.
RecoveryKindWildcard Match May Mask New VariantsFile:
retry/service.rs:184The wildcard
_silently handles any futureRecoveryKindvariants as non-recoverable. If a new variant is added that should be recoverable, this match arm would silently suppress it. The#[non_exhaustive]attribute onRecoveryKindmakes this a practical concern.Suggestion: Remove the wildcard and explicitly match all known variants, or add a comment explaining the deliberate catch-all design decision. Note: this is behind
mutants::skipso it may be intentional.10.
ExecutionResult::from_recoveryTreatsUnknownas SuccessFile:
breaker/execution_result.rs:27-33RecoveryKind::Unknownmaps toSuccess, which means the circuit breaker won't count unknown recoveries as failures. This might be intentional (conservative: don't trip breaker on unknowns), but it's worth documenting the design choice explicitly since it differs from how one might intuitively expect "unknown" to be treated.Usability
11.
ResilienceContextRequires Explicit Type ParametersFile:
context.rs:16The
InandOutphantom type parameters exist only to carry type information for the middleware pipeline. Users must specify them explicitly in some contexts (e.g.,ResilienceContext::<String, Result<String, String>>::new(&clock)), which is verbose. The types are only used to constrain layers at build time, not at construction time.Suggestion: Consider whether
ResilienceContextcould be type-erased and have the layer methods constrainIn/Outinstead, or provide a convenience constructor that infers types from usage.12. No
DefaultImplementation forHalfOpenModeFile:
breaker/half_open_mode.rsHalfOpenModedoesn't implementDefault, even thoughBreakerLayerdefaults toprogressive(None). Addingimpl Default for HalfOpenModewithprogressive(None)would let users writeHalfOpenMode::default()and align with the documented default.13.
BreakerLayer::layer()Can Panic at Runtime Despite Type-State PatternFile:
breaker/layer.rs:348-349The type-state pattern (
Set/NotSet) prevents callinglayer()without setting required fields, but the actual values are stored asOption<T>. The.expect()calls should never fire, but they represent a mismatch between the type-level and value-level guarantees. The same pattern exists inRetryLayerandTimeoutLayer.Suggestion: This is a minor internal concern (not a user-facing bug) since the type system prevents the invalid state. However, storing the values directly (not wrapped in
Option) in theSetstate would eliminate the.expect()calls entirely.14.
min_throughputAccepts 0 Without WarningFile:
breaker/layer.rs:170-173Setting
min_throughput(0)means the circuit breaker can trip on the very first failure, regardless of how little traffic there is. This is likely never the intended behavior and could cause immediate circuit opening in low-traffic scenarios.Suggestion: Either enforce a minimum of 1, or document the
0behavior explicitly.15. No Way to Inspect or Reset Circuit Breaker State
File:
breaker/service.rs,breaker/engine/engines.rsUsers cannot programmatically query the circuit breaker's current state (open/closed/half-open) or manually reset it. This makes operational tasks difficult — for example, after deploying a fix for a downstream service, operators cannot force the circuit closed to resume traffic immediately.
Suggestion: Expose a
state()oris_open()method onBreaker, and consider areset()method for operational needs.16. Circuit Breaker Engine Map Grows Without Bound
File:
breaker/engine/engines.rs:32-42The
Enginesmap never evicts entries. IfBreakerIdvalues are derived from dynamic input (e.g., per-host circuit breakers and hosts come and go), the map will grow indefinitely. The doc comment onBreakerIdwarns about this, but there's no runtime protection.Suggestion: Consider adding an optional eviction policy (LRU, TTL, or max-entries cap) to prevent unbounded memory growth in long-running services.
Summary
Mutex<HashMap>in engine map serializes lookupsBackoffOptionscloned per retry cycleHealthMetricslinear window scanVERSIONhardcoded atv0.1.0, crate is at0.3.0failure_thresholdallows negative valuesWindow::updateuses non-saturating arithmeticsuccessfullin telemetry fieldRecoveryKindmay mask new variantsUnknownrecovery treated as success in breakerResilienceContextrequires explicit type parametersDefaultforHalfOpenMode.expect()internallymin_throughput(0)has surprising behaviorHuman Notes
Some of the points are valid ones, will get through these and try to apply some reasoning and then fix ones that are relevant.