Skip to content

feat: Add histogram latency metrics for all inference phases#474

Open
zbennett10 wants to merge 1 commit intotriton-inference-server:mainfrom
zbennett10:feat/histogram-latency-metrics
Open

feat: Add histogram latency metrics for all inference phases#474
zbennett10 wants to merge 1 commit intotriton-inference-server:mainfrom
zbennett10:feat/histogram-latency-metrics

Conversation

@zbennett10
Copy link
Copy Markdown

@zbennett10 zbennett10 commented Feb 24, 2026

Summary

Adds 5 new Prometheus histogram metrics as equivalents to the existing 5 summary latency metrics. This gives operators bucket-based latency distributions that are more efficient for aggregation across instances in Prometheus compared to summaries.

New Metrics

Histogram Corresponding Summary Default Buckets (us)
nv_inference_request_duration_histogram_us nv_inference_request_summary_us 1000, 5000, 25000, 50000, 100000
nv_inference_queue_duration_histogram_us nv_inference_queue_summary_us 100, 1000, 5000, 10000, 50000
nv_inference_compute_input_duration_histogram_us nv_inference_compute_input_summary_us 100, 500, 1000, 5000, 10000
nv_inference_compute_infer_duration_histogram_us nv_inference_compute_infer_summary_us 1000, 5000, 25000, 50000, 100000
nv_inference_compute_output_duration_histogram_us nv_inference_compute_output_summary_us 100, 500, 1000, 5000, 10000

Design

  • Enabled via --metrics-config histogram_latencies=true (same flag as nv_inference_first_response_histogram_ms)
  • Custom bucket boundaries configurable per-model via model_metrics.metric_control[] in config.pbtxt
  • Request duration histogram is disabled when cache is enabled (matching the summary behavior, see DLIS-4762)
  • Follows the exact same lifecycle as the existing TTFT histogram: registered in Metrics singleton, exposed via Family* accessors, initialized in MetricModelReporter::InitializeHistograms(), observed via ObserveHistogram()

Files Changed (6 files, +113/-4)

File Change
src/constants.h Add 5 new histogram constant names
src/metrics.h Add 5 static family accessors + 5 member variables
src/metrics.cc Register 5 new histogram families in constructor
src/metric_model_reporter.h Add metric_map entries + default bucket boundaries
src/metric_model_reporter.cc Wire histogram families in InitializeHistograms()
src/infer_stats.cc Add ObserveHistogram() calls alongside existing summary observations

Motivation

Multiple users have requested histogram equivalents for the summary metrics (see #7672). Histograms are preferred over summaries for production monitoring because:

  1. Aggregatable: Histogram buckets can be aggregated across instances; summary quantiles cannot
  2. Predictable cost: Histograms have fixed memory overhead regardless of observation count
  3. Alerting-friendly: histogram_quantile() in PromQL allows flexible percentile calculations from histograms

PR #8580 extended the TTFT histogram to non-decoupled models. This PR completes the histogram story by adding the remaining 5 latency metrics that users have been requesting.

Related Issues

Test Plan

  • Pre-commit hooks pass (clang-format, codespell, trailing whitespace)
  • Constructor initializer order matches member declaration order (no compiler warnings)
  • Cache guard correctly applied to request_duration (matches summary behavior)
  • All metric names consistent across registration, metric_map, and observation calls
  • L0_metrics tests pass with Triton server (requires full build environment)
  • Histogram output verified in /metrics endpoint with histogram_latencies=true

Add 5 new Prometheus histogram metrics as equivalents to the existing
summary latency metrics. This gives operators bucket-based latency
distributions that are more efficient for aggregation in Prometheus
compared to summaries.

New metrics (enabled via --metrics-config histogram_latencies=true):
- nv_inference_request_duration_histogram_us
- nv_inference_queue_duration_histogram_us
- nv_inference_compute_input_duration_histogram_us
- nv_inference_compute_infer_duration_histogram_us
- nv_inference_compute_output_duration_histogram_us

Each histogram supports custom bucket boundaries via the model_metrics
configuration in config.pbtxt, following the same pattern as the
existing nv_inference_first_response_histogram_ms metric.

Addresses #7672
@zbennett10
Copy link
Copy Markdown
Author

Companion PR -> triton-inference-server/server#8675

@whoisj
Copy link
Copy Markdown
Contributor

whoisj commented Feb 25, 2026

@zbennett10, same as your other PR: please make sure you've completed the CLA requirements so that we can accept your contribution. Thanks!

@whoisj whoisj self-requested a review February 25, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants