feat: Add histogram latency metrics for all inference phases#474
Open
zbennett10 wants to merge 1 commit intotriton-inference-server:mainfrom
Open
feat: Add histogram latency metrics for all inference phases#474zbennett10 wants to merge 1 commit intotriton-inference-server:mainfrom
zbennett10 wants to merge 1 commit intotriton-inference-server:mainfrom
Conversation
Add 5 new Prometheus histogram metrics as equivalents to the existing summary latency metrics. This gives operators bucket-based latency distributions that are more efficient for aggregation in Prometheus compared to summaries. New metrics (enabled via --metrics-config histogram_latencies=true): - nv_inference_request_duration_histogram_us - nv_inference_queue_duration_histogram_us - nv_inference_compute_input_duration_histogram_us - nv_inference_compute_infer_duration_histogram_us - nv_inference_compute_output_duration_histogram_us Each histogram supports custom bucket boundaries via the model_metrics configuration in config.pbtxt, following the same pattern as the existing nv_inference_first_response_histogram_ms metric. Addresses #7672
3 tasks
Author
|
Companion PR -> triton-inference-server/server#8675 |
Contributor
|
@zbennett10, same as your other PR: please make sure you've completed the CLA requirements so that we can accept your contribution. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 5 new Prometheus histogram metrics as equivalents to the existing 5 summary latency metrics. This gives operators bucket-based latency distributions that are more efficient for aggregation across instances in Prometheus compared to summaries.
New Metrics
nv_inference_request_duration_histogram_usnv_inference_request_summary_usnv_inference_queue_duration_histogram_usnv_inference_queue_summary_usnv_inference_compute_input_duration_histogram_usnv_inference_compute_input_summary_usnv_inference_compute_infer_duration_histogram_usnv_inference_compute_infer_summary_usnv_inference_compute_output_duration_histogram_usnv_inference_compute_output_summary_usDesign
--metrics-config histogram_latencies=true(same flag asnv_inference_first_response_histogram_ms)model_metrics.metric_control[]in config.pbtxtMetricssingleton, exposed viaFamily*accessors, initialized inMetricModelReporter::InitializeHistograms(), observed viaObserveHistogram()Files Changed (6 files, +113/-4)
src/constants.hsrc/metrics.hsrc/metrics.ccsrc/metric_model_reporter.hsrc/metric_model_reporter.ccsrc/infer_stats.ccMotivation
Multiple users have requested histogram equivalents for the summary metrics (see #7672). Histograms are preferred over summaries for production monitoring because:
histogram_quantile()in PromQL allows flexible percentile calculations from histogramsPR #8580 extended the TTFT histogram to non-decoupled models. This PR completes the histogram story by adding the remaining 5 latency metrics that users have been requesting.
Related Issues
Test Plan
L0_metricstests pass with Triton server (requires full build environment)/metricsendpoint withhistogram_latencies=true