feat: Add histogram latency metrics for all inference phases by zbennett10 · Pull Request #474 · triton-inference-server/core

zbennett10 · 2026-02-24T18:33:38Z

Summary

Adds 5 new Prometheus histogram metrics as equivalents to the existing 5 summary latency metrics. This gives operators bucket-based latency distributions that are more efficient for aggregation across instances in Prometheus compared to summaries.

New Metrics

Histogram	Corresponding Summary	Default Buckets (us)
`nv_inference_request_duration_histogram_us`	`nv_inference_request_summary_us`	1000, 5000, 25000, 50000, 100000
`nv_inference_queue_duration_histogram_us`	`nv_inference_queue_summary_us`	100, 1000, 5000, 10000, 50000
`nv_inference_compute_input_duration_histogram_us`	`nv_inference_compute_input_summary_us`	100, 500, 1000, 5000, 10000
`nv_inference_compute_infer_duration_histogram_us`	`nv_inference_compute_infer_summary_us`	1000, 5000, 25000, 50000, 100000
`nv_inference_compute_output_duration_histogram_us`	`nv_inference_compute_output_summary_us`	100, 500, 1000, 5000, 10000

Design

Enabled via --metrics-config histogram_latencies=true (same flag as nv_inference_first_response_histogram_ms)
Custom bucket boundaries configurable per-model via model_metrics.metric_control[] in config.pbtxt
Request duration histogram is disabled when cache is enabled (matching the summary behavior, see DLIS-4762)
Follows the exact same lifecycle as the existing TTFT histogram: registered in Metrics singleton, exposed via Family* accessors, initialized in MetricModelReporter::InitializeHistograms(), observed via ObserveHistogram()

Files Changed (6 files, +113/-4)

File	Change
`src/constants.h`	Add 5 new histogram constant names
`src/metrics.h`	Add 5 static family accessors + 5 member variables
`src/metrics.cc`	Register 5 new histogram families in constructor
`src/metric_model_reporter.h`	Add metric_map entries + default bucket boundaries
`src/metric_model_reporter.cc`	Wire histogram families in InitializeHistograms()
`src/infer_stats.cc`	Add ObserveHistogram() calls alongside existing summary observations

Motivation

Multiple users have requested histogram equivalents for the summary metrics (see #7672). Histograms are preferred over summaries for production monitoring because:

Aggregatable: Histogram buckets can be aggregated across instances; summary quantiles cannot
Predictable cost: Histograms have fixed memory overhead regardless of observation count
Alerting-friendly: histogram_quantile() in PromQL allows flexible percentile calculations from histograms

PR #8580 extended the TTFT histogram to non-decoupled models. This PR completes the histogram story by adding the remaining 5 latency metrics that users have been requesting.

Related Issues

Addresses Histogram Metric for multi-instance tail latency aggregation server#7672
Builds on PR #8580 (TTFT histogram for non-decoupled models)
Companion PR: triton-inference-server/server (docs + test patterns)

Test Plan

Pre-commit hooks pass (clang-format, codespell, trailing whitespace)
Constructor initializer order matches member declaration order (no compiler warnings)
Cache guard correctly applied to request_duration (matches summary behavior)
All metric names consistent across registration, metric_map, and observation calls
L0_metrics tests pass with Triton server (requires full build environment)
Histogram output verified in /metrics endpoint with histogram_latencies=true

Add 5 new Prometheus histogram metrics as equivalents to the existing summary latency metrics. This gives operators bucket-based latency distributions that are more efficient for aggregation in Prometheus compared to summaries. New metrics (enabled via --metrics-config histogram_latencies=true): - nv_inference_request_duration_histogram_us - nv_inference_queue_duration_histogram_us - nv_inference_compute_input_duration_histogram_us - nv_inference_compute_infer_duration_histogram_us - nv_inference_compute_output_duration_histogram_us Each histogram supports custom bucket boundaries via the model_metrics configuration in config.pbtxt, following the same pattern as the existing nv_inference_first_response_histogram_ms metric. Addresses #7672

zbennett10 · 2026-02-24T18:36:15Z

Companion PR -> triton-inference-server/server#8675

whoisj · 2026-02-25T19:31:41Z

@zbennett10, same as your other PR: please make sure you've completed the CLA requirements so that we can accept your contribution. Thanks!

zbennett10 mentioned this pull request Feb 24, 2026

docs: Add histogram latency metrics documentation and test patterns triton-inference-server/server#8675

Open

3 tasks

whoisj self-requested a review February 25, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add histogram latency metrics for all inference phases#474

feat: Add histogram latency metrics for all inference phases#474
zbennett10 wants to merge 1 commit intotriton-inference-server:mainfrom
zbennett10:feat/histogram-latency-metrics

zbennett10 commented Feb 24, 2026 •

edited

Loading

Uh oh!

zbennett10 commented Feb 24, 2026

Uh oh!

whoisj commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

zbennett10 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Metrics

Design

Files Changed (6 files, +113/-4)

Motivation

Related Issues

Test Plan

Uh oh!

zbennett10 commented Feb 24, 2026

Uh oh!

whoisj commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

zbennett10 commented Feb 24, 2026 •

edited

Loading