Skip to content

feat: export iroh relay metrics via OpenTelemetry#489

Open
ThetaSinner wants to merge 1 commit intomainfrom
iroh-relay-otel-metrics
Open

feat: export iroh relay metrics via OpenTelemetry#489
ThetaSinner wants to merge 1 commit intomainfrom
iroh-relay-otel-metrics

Conversation

@ThetaSinner
Copy link
Copy Markdown
Member

@ThetaSinner ThetaSinner commented Mar 24, 2026

I notice that we're actually very short on metrics for the bootstrap service. When metrics were first integrated here, it was really only for SBD. So we now expose the Iroh metrics using opentelemetry but we could do with following up to add more bootstrap metrics.

Summary

  • Bridge iroh-relay-holochain metrics to OTEL using observable counters: bytes_sent, bytes_recv, accepts, disconnects, unique_client_keys, send_packets_dropped, other_packets_dropped, bytes_rx_ratelimited, conns_rx_ratelimited
  • Move OTEL meter provider initialization out of the sbd feature gate into a feature-independent metrics module, so --otlp_endpoint works with the iroh-relay backend
  • Add integration test proving the full chain: relay handles real client traffic → iroh-metrics atomics increment → OTEL exporter captures non-zero values

Test plan

  • cargo make static passes (both iroh-relay and sbd features)
  • Integration test relay_metrics_exported_via_otel passes
  • Deploy with --otlp_endpoint and verify metrics appear in the collector

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Mar 24, 2026

Deploying kitsune2 with  Cloudflare Pages  Cloudflare Pages

Latest commit: a2ca824
Status: ✅  Deploy successful!
Preview URL: https://80531d6b.kitsune2.pages.dev
Branch Preview URL: https://iroh-relay-otel-metrics.kitsune2.pages.dev

View logs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 24, 2026

The following will be added to the changelog


[0.5.0-dev.1] - 2026-04-16

Features

  • Export iroh relay metrics via OpenTelemetry
    • Bridge iroh-relay-holochain metrics (bytes_sent, bytes_recv, accepts, disconnects, unique_client_keys, packets dropped, rate limiting) to OTEL using observable counters that read the iroh-metrics atomics at export time.
    • Move OTEL meter provider initialization out of the SBD feature gate so metrics export works regardless of which relay backend is active.
    • Add integration test that verifies the full chain: relay handles client traffic, iroh-metrics atomics increment, OTEL exporter captures the values.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Workspace OpenTelemetry dependencies were consolidated and expanded. The bootstrap server gained a public metrics module and a feature-gated iroh_relay_metrics module that registers OTEL observable counters. OTLP metrics initialization was moved to run unconditionally and returns an optional meter provider for flush control; shutdown now attempts a forced flush. Relay startup was changed to retain OTEL registration handles for the server lifetime. A feature-gated integration test verifies relay metrics are exported via an in-memory OTLP provider.

Suggested reviewers

  • jost-s
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: export iroh relay metrics via OpenTelemetry' clearly summarizes the main objective: exporting iroh relay metrics via OTEL observable counters, which is the primary change across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is clearly related to the changeset and provides meaningful context about the metrics integration work being performed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch iroh-relay-otel-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/bootstrap_srv/src/http.rs (1)

300-326: ⚠️ Potential issue | 🔴 Critical

Keep the relay metrics guard alive for the server lifetime.

register() returns a guard whose callbacks stay active only while that value is alive. This binding is scoped to the setup block and is dropped at Line 327, before any server future is awaited, so the relay observers are unregistered during startup. Please lift the guard to the outer async scope or store it on Server/Ready.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/bootstrap_srv/src/http.rs` around lines 300 - 326, The relay metrics
guard returned by
crate::iroh_relay_metrics::register(relay_state.metrics.clone()) is created in a
short-lived block and assigned to the local relay_otel_metrics binding which is
dropped at the end of that block, causing observers to unregister during
startup; move the guard into a longer-lived scope by returning or storing it on
a longer-lived object (for example add a field to your Server or Ready struct
like relay_otel_metrics or push it into an outer Vec of guards) so the value
returned by register() remains alive for the server lifetime; update the setup
code that currently defines relay_otel_metrics inside the inner block (and any
uses of create_relay_state/create_relay_state_with_allowlist) to instead assign
the guard to the new Server/Ready storage or an outer-scope variable so the
guard is not dropped on function exit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/bootstrap_srv/src/metrics.rs`:
- Line 20: Replace the current tracing::info log that emits the raw OTLP
endpoint (tracing::info!("Enabling OpenTelemetry metrics export to {endpoint}"))
so it no longer logs operator-supplied secrets; either log a generic message
like "Enabling OpenTelemetry metrics export" or redact the endpoint to only
scheme/host before including it. Update the logging call in the metrics
initialization code (the tracing::info invocation that references endpoint) to
avoid outputting the full URL with credentials or query tokens.

In `@crates/bootstrap_srv/tests/relay_metrics.rs`:
- Around line 37-47: The test currently bypasses production wiring by calling
iroh_relay_axum::create_relay_state and iroh_relay_metrics::register and
hand-building Router with relay_handler; instead, construct the server through
the production bootstrap HTTP wiring (the module that builds/starts the real
HTTP app) so the metrics guard is installed and retained exactly as in prod;
replace the manual Router/metrics/register calls with a call to the bootstrap
HTTP entrypoint that returns or starts the App/Server used in production (the
same code that lives in the bootstrap HTTP module) and exercise that server in
the test so the real installation/lifetime of the metrics guard is covered.

---

Outside diff comments:
In `@crates/bootstrap_srv/src/http.rs`:
- Around line 300-326: The relay metrics guard returned by
crate::iroh_relay_metrics::register(relay_state.metrics.clone()) is created in a
short-lived block and assigned to the local relay_otel_metrics binding which is
dropped at the end of that block, causing observers to unregister during
startup; move the guard into a longer-lived scope by returning or storing it on
a longer-lived object (for example add a field to your Server or Ready struct
like relay_otel_metrics or push it into an outer Vec of guards) so the value
returned by register() remains alive for the server lifetime; update the setup
code that currently defines relay_otel_metrics inside the inner block (and any
uses of create_relay_state/create_relay_state_with_allowlist) to instead assign
the guard to the new Server/Ready storage or an outer-scope variable so the
guard is not dropped on function exit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f796de26-013a-4b63-aa60-7c707880f031

📥 Commits

Reviewing files that changed from the base of the PR and between ee51894 and 3c104bc.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml
  • crates/bootstrap_srv/Cargo.toml
  • crates/bootstrap_srv/src/bin/kitsune2-bootstrap-srv.rs
  • crates/bootstrap_srv/src/http.rs
  • crates/bootstrap_srv/src/iroh_relay_metrics.rs
  • crates/bootstrap_srv/src/lib.rs
  • crates/bootstrap_srv/src/metrics.rs
  • crates/bootstrap_srv/tests/relay_metrics.rs

Comment thread crates/bootstrap_srv/src/metrics.rs Outdated
Comment thread crates/bootstrap_srv/tests/relay_metrics.rs Outdated
@synchwire synchwire force-pushed the iroh-relay-otel-metrics branch from 3c104bc to 42386bd Compare April 13, 2026 13:47
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/bootstrap_srv/tests/relay_metrics.rs (1)

37-47: ⚠️ Potential issue | 🟠 Major

Route this test through the production bootstrap HTTP wiring.

Line 37 through Line 47 still build relay state/router manually, so this test can pass even if production wiring in crates/bootstrap_srv/src/http.rs regresses (especially metrics guard install/retention). Please instantiate the server through the same bootstrap HTTP entrypoint used in production.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/bootstrap_srv/tests/relay_metrics.rs` around lines 37 - 47, The test
currently builds the relay state and Router manually via create_relay_state,
iroh_relay_metrics::register and Router::new/route/with_state, which bypasses
the production wiring in crates/bootstrap_srv/src/http.rs; replace that manual
setup by calling the production HTTP entrypoint provided in the bootstrap_srv
http module (the function that constructs/returns the production Router or
starts the server) so the test uses the same wiring (including metrics guard
installation/retention) as production—remove the manual
create_relay_state/iroh_relay_metrics::register/Router::new logic and
instantiate the server/router through the HTTP entrypoint from the
bootstrap_srv::http module instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/bootstrap_srv/tests/relay_metrics.rs`:
- Around line 37-47: The test currently builds the relay state and Router
manually via create_relay_state, iroh_relay_metrics::register and
Router::new/route/with_state, which bypasses the production wiring in
crates/bootstrap_srv/src/http.rs; replace that manual setup by calling the
production HTTP entrypoint provided in the bootstrap_srv http module (the
function that constructs/returns the production Router or starts the server) so
the test uses the same wiring (including metrics guard installation/retention)
as production—remove the manual
create_relay_state/iroh_relay_metrics::register/Router::new logic and
instantiate the server/router through the HTTP entrypoint from the
bootstrap_srv::http module instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0caf8b45-572b-40f3-ba44-0c8bc595e9dc

📥 Commits

Reviewing files that changed from the base of the PR and between 3c104bc and 42386bd.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml
  • crates/bootstrap_srv/Cargo.toml
  • crates/bootstrap_srv/src/bin/kitsune2-bootstrap-srv.rs
  • crates/bootstrap_srv/src/http.rs
  • crates/bootstrap_srv/src/iroh_relay_metrics.rs
  • crates/bootstrap_srv/src/lib.rs
  • crates/bootstrap_srv/src/metrics.rs
  • crates/bootstrap_srv/tests/relay_metrics.rs
✅ Files skipped from review due to trivial changes (1)
  • crates/bootstrap_srv/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (5)
  • crates/bootstrap_srv/src/lib.rs
  • Cargo.toml
  • crates/bootstrap_srv/src/metrics.rs
  • crates/bootstrap_srv/src/http.rs
  • crates/bootstrap_srv/src/iroh_relay_metrics.rs

@synchwire synchwire force-pushed the iroh-relay-otel-metrics branch 2 times, most recently from e9035fb to 37206ef Compare April 13, 2026 14:53
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/bootstrap_srv/tests/relay_metrics.rs (1)

1-150: Move this end-to-end test up to kitsune2.

This is a real product-wiring test: it boots BootstrapSrv, opens actual relay connections, and validates exported OTEL metrics end to end. That coverage is better owned by the top-level kitsune2 crate; if you want to keep crate-local coverage here, keep it unit-scoped around the relay-metrics bridge and move the full-stack flow upward.

Based on learnings, Place genuine integration tests using real production modules in the kitsune2 crate, not in lower-level crates. As per coding guidelines, "Prioritize unit tests as the default testing layer, covering as much as possible at this level".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/bootstrap_srv/tests/relay_metrics.rs` around lines 1 - 150, The
end-to-end test relay_metrics_exported_via_otel boots the production
BootstrapSrv and validates OTEL metrics and should be moved from this
lower-level crate into the top-level kitsune2 integration tests; remove this
test from crates/bootstrap_srv/tests/relay_metrics.rs and add an equivalent
integration test under the kitsune2 crate (e.g., tests/relay_metrics.rs) that
imports BootstrapSrv, Config, RelayUrl, RelayMap, Endpoint, etc., preserving the
same logic and feature gating (#![cfg(feature = "iroh-relay")]) and ensuring the
test uses the production wiring (BootstrapSrv::new) and the same OTEL setup; if
you want to keep crate-local coverage here, replace the removed test with a
smaller unit-scoped test that targets only the relay-metrics bridge functions
(call sites/types: BootstrapSrv, Config, RelayMap, RelayUrl,
relay_metrics_exported_via_otel) rather than full end-to-end server startup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/bootstrap_srv/src/metrics.rs`:
- Around line 33-40: The enable_otlp_metrics() helper currently installs the
SdkMeterProvider globally but returns Result<()> so callers cannot flush pending
metric batches; change enable_otlp_metrics() to return the created
opentelemetry_sdk::metrics::SdkMeterProvider (or a guard/handle) instead of
unit, keep calling global::set_meter_provider(meter_provider.clone()) or set the
provider then return the provider/guard, and update the caller to accept and
store that returned provider so it can call provider.force_flush() (or
provider.shutdown()/force_flush as appropriate) during graceful shutdown to
flush periodic exporter batches before process exit; refer to
enable_otlp_metrics, SdkMeterProvider, global::set_meter_provider, and
force_flush/shutdown in your changes.

---

Nitpick comments:
In `@crates/bootstrap_srv/tests/relay_metrics.rs`:
- Around line 1-150: The end-to-end test relay_metrics_exported_via_otel boots
the production BootstrapSrv and validates OTEL metrics and should be moved from
this lower-level crate into the top-level kitsune2 integration tests; remove
this test from crates/bootstrap_srv/tests/relay_metrics.rs and add an equivalent
integration test under the kitsune2 crate (e.g., tests/relay_metrics.rs) that
imports BootstrapSrv, Config, RelayUrl, RelayMap, Endpoint, etc., preserving the
same logic and feature gating (#![cfg(feature = "iroh-relay")]) and ensuring the
test uses the production wiring (BootstrapSrv::new) and the same OTEL setup; if
you want to keep crate-local coverage here, replace the removed test with a
smaller unit-scoped test that targets only the relay-metrics bridge functions
(call sites/types: BootstrapSrv, Config, RelayMap, RelayUrl,
relay_metrics_exported_via_otel) rather than full end-to-end server startup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d3a0cd8-14be-43d1-ae4d-d1d7251d9e63

📥 Commits

Reviewing files that changed from the base of the PR and between 42386bd and 37206ef.

📒 Files selected for processing (3)
  • crates/bootstrap_srv/src/http.rs
  • crates/bootstrap_srv/src/metrics.rs
  • crates/bootstrap_srv/tests/relay_metrics.rs

Comment thread crates/bootstrap_srv/src/metrics.rs Outdated
@synchwire synchwire force-pushed the iroh-relay-otel-metrics branch 2 times, most recently from 2ebfab0 to 2bb1498 Compare April 13, 2026 16:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/bootstrap_srv/src/metrics.rs`:
- Around line 15-17: The rustdoc intra-doc links to SdkMeterProvider and
SdkMeterProvider::force_flush are unresolved; update the doc comment to use
either fully-qualified paths (e.g. opentelemetry_sdk::SdkMeterProvider and
opentelemetry_sdk::SdkMeterProvider::force_flush) or convert the links to plain
code spans (`SdkMeterProvider`, `SdkMeterProvider::force_flush`) so rustdoc can
resolve them and the docs build without warnings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4cef3817-af94-4fd3-af0c-18b135bc2ebe

📥 Commits

Reviewing files that changed from the base of the PR and between 37206ef and 2ebfab0.

📒 Files selected for processing (4)
  • crates/bootstrap_srv/src/bin/kitsune2-bootstrap-srv.rs
  • crates/bootstrap_srv/src/http.rs
  • crates/bootstrap_srv/src/metrics.rs
  • crates/bootstrap_srv/tests/relay_metrics.rs
✅ Files skipped from review due to trivial changes (1)
  • crates/bootstrap_srv/src/http.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/bootstrap_srv/src/bin/kitsune2-bootstrap-srv.rs
  • crates/bootstrap_srv/tests/relay_metrics.rs

Comment thread crates/bootstrap_srv/src/metrics.rs Outdated
@ThetaSinner ThetaSinner force-pushed the iroh-relay-otel-metrics branch from 2bb1498 to ea2c890 Compare April 16, 2026 11:28
Bridge iroh-relay-holochain metrics (bytes_sent, bytes_recv, accepts,
disconnects, unique_client_keys, packets dropped, rate limiting) to
OTEL using observable counters that read the iroh-metrics atomics at
export time.

Move OTEL meter provider initialization out of the SBD feature gate
so metrics export works regardless of which relay backend is active.

Add integration test that verifies the full chain: relay handles
client traffic, iroh-metrics atomics increment, OTEL exporter
captures the values.
@ThetaSinner ThetaSinner force-pushed the iroh-relay-otel-metrics branch from 17983f2 to a2ca824 Compare April 16, 2026 13:57
@cocogitto-bot
Copy link
Copy Markdown

cocogitto-bot Bot commented Apr 16, 2026

✔️ a2ca824 - Conventional commits check succeeded.

@ThetaSinner ThetaSinner requested a review from a team April 16, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant