Skip to content

Ensemble Model only marginally faster than separate single-model calls -- is this expected? #8740

@corentin87

Description

@corentin87

Hi everyone,

I am evaluating whether a Triton ensemble pipeline (with BLS for conditional
logic) can outperform a two-call single-model approach for an RT-DETR
detection + ReID re-identification pipeline. Server-side preprocessing (Python/DALI) and postprocessing (Python) are
significantly slower** than keeping pre/postprocessing on the client. I'd like to
understand if this is expected behavior or if there are configuration
improvements I'm missing.

Setup

  • Triton Server: custom image with Python, TRT and Dali backend from 2.66.0 (nvcr.io/nvidia/tritonserver:26.02-py3)
  • GPU: NVIDIA RTX A2000
  • Models:
Model Backend Precision Max Batch
rt-detr TensorRT (GPU) FP16 8
reid TensorRT (GPU) FP16 256
detect_preprocessing Python (CPU) or DALI (GPU)
detect_postprocessing_reid Python (CPU), BLS

Pre/Post processing steps in details:

  • detect_preprocessing is doing:
    • resizing
    • RGB conversion
    • normalisation
  • detect_postprocessing_reid is doing:
    • post processing on the bounding boxes
    • cropping the detections from the original image
    • calling ReID (sync)
    • normalising the feature vectors

Pipeline Architectures

Single mode: The client makes two separate async gRPC calls per frame and
handles all pre/post-processing locally:

Client (pre/post-processing) → gRPC 1 → rt-detr → Client (post/pre) → gRPC 2 → reid → Client (post)

Ensemble mode: A single async gRPC call drives the full pipeline server-side. BLS
is used inside detect_postprocessing_reid so that reid is only invoked when
detections are present (conditional logic that the static ensemble DAG cannot
express):
raw_image → detect_preprocessing → rt-detr → detect_postprocessing_reid (BLS) → reid → outputs

Ensemble config.pbtxt:

platform: "ensemble"
max_batch_size: 4

input [
  {
    name: "raw_image"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
  },
  {
    name: "confidence_threshold"
    data_type: TYPE_FP32
    dims: [ 1]
  }
]

output [
  {
    name: "detection_boxes_out"
    data_type: TYPE_FP32
    dims: [ 4 ]
  },
  {
    name: "detection_confidences_out"
    data_type: TYPE_FP32
    dims: [ 1 ]
  },
  {
    name: "detection_labels_out"
    data_type: TYPE_FP32
    dims: [ 1 ]
  },
  {
    name: "descriptors_out"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "detect_preprocessing"
      model_version: -1
      input_map {
        key: "image"
        value: "raw_image"
      }
      output_map {
        key: "image"
        value: "image"
      }
    },
    {
      model_name: "rt-detr"
      model_version: -1
      input_map {
        key: "inputs"
        value: "image"
      }
      output_map {
        key: "pred_boxes"
        value: "detection_boxes"
      }
      output_map {
        key: "pred_logits"
        value: "detection_logits"
      }
    },
    {
      model_name: "detect_postprocessing_reid"
      model_version: -1
      input_map {
        key: "detection_logits"
        value: "detection_logits"
      }
      input_map {
        key: "detection_boxes"
        value: "detection_boxes"
      }
      input_map {
        key: "image"
        value: "raw_image"
      }
      input_map {
        key: "confidence_threshold"
        value: "confidence_threshold"
      }
      output_map {
        key: "detection_boxes"
        value: "detection_boxes_out"
      },
      output_map {
        key: "detection_confidences"
        value: "detection_confidences_out"
      },
      output_map {
        key: "detection_labels"
        value: "detection_labels_out"
      }
      output_map {
        key: "descriptors"
        value: "descriptors_out"
      }
    }
  ]
}

Based on initial experiments, detect_preprocessing running as a Python backend
model appeared to be a bottleneck. To evaluate this, preprocessing was tested in
three variants:

  • Client-side — preprocessing on the client before sending to the ensemble
    (not part of the ensemble graph)
  • Python backend — preprocessing on the server using OpenCV (CPU)
  • DALI backend — preprocessing on the server using NVIDIA DALI (GPU)

Results before Model Analyzer

Pipeline Throughput (FPS) — concurrency sweep, no dynamic batching, 1 client

Pipeline Pre-processing c=1 c=4 c=8
Single Client 8.0 FPS 74.3 FPS 73.0 FPS
Ensemble Client 11.6 FPS 77.4 FPS 75.3 FPS
Ensemble Python Backend 13.0 FPS 38.0 FPS 61.4 FPS
Ensemble DALI Backend 12.7 FPS 39.7 FPS 67.7 FPS

Pipeline Throughput with dynamic batching enabled (concurrency=8), 1 client

Configuration No DM DM Delta
Single + Client 73.0 FPS 73.6 FPS +1%
Ensemble + Client 75.3 FPS 78.9 FPS +5%
Ensemble + Python 61.4 FPS 56.2 FPS −8%
Ensemble + DALI 67.7 FPS 66.2 FPS −2%

GPU utilisation (concurrency=8)

Configuration GPU Util (%) CPU (%)
Single + Client 51–54 41–43
Ensemble + Client 54–55 52
Ensemble + Python 44–46 57
Ensemble + DALI 52–54 60–62

Below are the plots of the different average times. (dynamic batching enabled)

Image

Instance Count sweep for Python and Dali backend

Image Image

Results with Model Analyzer

After getting the mitigated results above, I attempted to use
Model Analyzer to find the optimal configuration for each ensemble variant. However, I could not profile my ensemble because it includes a BLS call.
Therefore, I removed the ReID BLS call from detection_postprocessing_reid to be able to run MA and get the best config (named config best in the plots).

Model analyzer results:

  • detect_preprocessing_config_7: 4 CPU instances with a max batch size of 4 on platform python/dali
  • rt-detr_config_3: 1 GPU instance with a max batch size of 4 on platform tensorrt
  • detect_postprocessing_reid_config_1: 2 CPU instance with a max batch size of 4 on platform python

Pipeline Throughput (comparison 1 vs 6 clients)

Configuration Throughput (1 Client) Throughput (6 clients)
Single + Client 73.6 96.0
Ensemble + Python 71.9 71.2
Ensemble + DALI 70.8 70.0
Image Image

Observations

Using the optimal configurations from Model Analyzer for both the Python and DALI backends, the Ensemble pipeline consistently shows lower throughput than the Single Pipeline (client-side orchestration):

1 client: Ensemble is ~2–4% slower
6 clients: Ensemble is ~25% slower

This is the opposite of what we expected when moving to an ensemble. The performance gap also widens with concurrency — the more inferences in flight, the more the ensemble falls behind.

Questions

Is it expected that an ensemble (with BLS) shows only marginal — or even negative — improvement over separate client-side model calls at high concurrency?

Our hypothesis was that eliminating the second gRPC round-trip and keeping tensors server-side would yield a noticeable speedup, but the benefit appears to be minimal (or negative under load). Is this assumption wrong?

Is the Ensemble backend not recommended for workloads with a large number of concurrent requests? If so, what is the recommended pattern for chaining preprocessing → inference → postprocessing when throughput is the primary concern?

Any guidance on whether these results are in line with what's expected for this
type of pipeline, or if there are optimizations I am missing, would be very
appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions