Ensemble Model only marginally faster than separate single-model calls -- is this expected?

Hi everyone,

I am evaluating whether a Triton **ensemble pipeline** (with BLS for conditional
logic) can outperform a **two-call single-model** approach for an RT-DETR
detection + ReID re-identification pipeline. Server-side preprocessing (Python/DALI)  and postprocessing (Python) are
significantly slower** than keeping pre/postprocessing on the client. I'd like to
understand if this is expected behavior or if there are configuration
improvements I'm missing.

## Setup

- **Triton Server:** custom image with Python, TRT and Dali backend  from 2.66.0 (`nvcr.io/nvidia/tritonserver:26.02-py3`) 
- **GPU:** NVIDIA RTX A2000
- **Models:**

| Model                        | Backend        | Precision | Max Batch |
|------------------------------|----------------|-----------|-----------|
| `rt-detr`                    | TensorRT (GPU) | FP16      | 8         |
| `reid`                       | TensorRT (GPU) | FP16      | 256       |
| `detect_preprocessing`       | Python (CPU) or DALI (GPU) | — | — |
| `detect_postprocessing_reid` | Python (CPU), BLS | —      | —         |

Pre/Post processing steps in details:

- **detect_preprocessing** is doing:
    - resizing
   -  RGB conversion
  -  normalisation
- **detect_postprocessing_reid** is doing:
  -  post processing on the bounding boxes
  -  cropping the detections from the original image
  -    calling ReID (sync)
  -  normalising the feature vectors

## Pipeline Architectures

**Single mode:** The client makes two separate async gRPC calls per frame and
handles all pre/post-processing locally:

Client (pre/post-processing) → gRPC 1 → rt-detr → Client (post/pre) → gRPC 2 → reid → Client (post)

**Ensemble mode:** A single async gRPC call drives the full pipeline server-side. BLS
is used inside `detect_postprocessing_reid` so that `reid` is only invoked when
detections are present (conditional logic that the static ensemble DAG cannot
express):
raw_image → detect_preprocessing → rt-detr → detect_postprocessing_reid (BLS) → reid → outputs

Ensemble config.pbtxt:

```name: "ensemble_model"
platform: "ensemble"
max_batch_size: 4

input [
  {
    name: "raw_image"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
  },
  {
    name: "confidence_threshold"
    data_type: TYPE_FP32
    dims: [ 1]
  }
]

output [
  {
    name: "detection_boxes_out"
    data_type: TYPE_FP32
    dims: [ 4 ]
  },
  {
    name: "detection_confidences_out"
    data_type: TYPE_FP32
    dims: [ 1 ]
  },
  {
    name: "detection_labels_out"
    data_type: TYPE_FP32
    dims: [ 1 ]
  },
  {
    name: "descriptors_out"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "detect_preprocessing"
      model_version: -1
      input_map {
        key: "image"
        value: "raw_image"
      }
      output_map {
        key: "image"
        value: "image"
      }
    },
    {
      model_name: "rt-detr"
      model_version: -1
      input_map {
        key: "inputs"
        value: "image"
      }
      output_map {
        key: "pred_boxes"
        value: "detection_boxes"
      }
      output_map {
        key: "pred_logits"
        value: "detection_logits"
      }
    },
    {
      model_name: "detect_postprocessing_reid"
      model_version: -1
      input_map {
        key: "detection_logits"
        value: "detection_logits"
      }
      input_map {
        key: "detection_boxes"
        value: "detection_boxes"
      }
      input_map {
        key: "image"
        value: "raw_image"
      }
      input_map {
        key: "confidence_threshold"
        value: "confidence_threshold"
      }
      output_map {
        key: "detection_boxes"
        value: "detection_boxes_out"
      },
      output_map {
        key: "detection_confidences"
        value: "detection_confidences_out"
      },
      output_map {
        key: "detection_labels"
        value: "detection_labels_out"
      }
      output_map {
        key: "descriptors"
        value: "descriptors_out"
      }
    }
  ]
}
```


Based on initial experiments, `detect_preprocessing` running as a Python backend
model appeared to be a bottleneck. To evaluate this, preprocessing was tested in
three variants:
- **Client-side** — preprocessing on the client before sending to the ensemble
  (not part of the ensemble graph)
- **Python backend** — preprocessing on the server using OpenCV (CPU)
- **DALI backend** — preprocessing on the server using NVIDIA DALI (GPU)

## Results before Model Analyzer
### Pipeline Throughput (FPS) — concurrency sweep, no dynamic batching, 1 client
| Pipeline             | Pre-processing  | c=1       | c=4       | c=8       |
|----------------------|-----------------|-----------|-----------|-----------|
| Single               | Client          | 8.0 FPS   | 74.3 FPS  | 73.0 FPS  |
| Ensemble             | Client          | 11.6 FPS  | 77.4 FPS  | 75.3 FPS  |
| Ensemble             | Python Backend  | 13.0 FPS  | 38.0 FPS  | 61.4 FPS  |
| Ensemble             | DALI Backend    | 12.7 FPS  | 39.7 FPS  | 67.7 FPS  |

### Pipeline Throughput with dynamic batching enabled (concurrency=8), 1 client
| Configuration        | No DM     | DM        | Delta |
|----------------------|-----------|-----------|-------|
| Single + Client      | 73.0 FPS  | 73.6 FPS  | +1%   |
| Ensemble + Client    | 75.3 FPS  | 78.9 FPS  | +5%   |
| Ensemble + Python    | 61.4 FPS  | 56.2 FPS  | −8%   |
| Ensemble + DALI      | 67.7 FPS  | 66.2 FPS  | −2%   |

### GPU utilisation (concurrency=8)
| Configuration        | GPU Util (%) | CPU (%) |
|----------------------|--------------|---------|
| Single + Client      | 51–54        | 41–43   |
| Ensemble + Client    | 54–55        | 52      |
| Ensemble + Python    | 44–46        | 57      |
| Ensemble + DALI      | 52–54        | 60–62   |

Below are the plots of the different average times. (dynamic batching enabled) 

<img width="2400" height="2025" alt="Image" src="https://github.com/user-attachments/assets/752cb4dc-5695-4ce7-8a97-9cfce1c3f437" />

Instance Count sweep for Python and Dali backend

<img width="1575" height="1950" alt="Image" src="https://github.com/user-attachments/assets/2b8924e1-8702-408e-935b-ebccbca16f79" />

<img width="1575" height="1950" alt="Image" src="https://github.com/user-attachments/assets/bc466a15-b058-46de-b2d2-69598409ca36" />


### Results with Model Analyzer

After getting the mitigated results above, I attempted to use
Model Analyzer to find the optimal configuration for each ensemble variant. However, I could not profile my ensemble because it includes a BLS call.
Therefore, I removed the ReID BLS call from `detection_postprocessing_reid` to be able to run MA and get the best config (named `config best` in the plots).

Model analyzer results:
* detect_preprocessing_config_7: 4 CPU instances with a max batch size of 4 on platform python/dali
* rt-detr_config_3: 1 GPU instance with a max batch size of 4 on platform tensorrt
* detect_postprocessing_reid_config_1: 2 CPU instance with a max batch size of 4 on platform python

### Pipeline Throughput (comparison 1 vs 6 clients)
| Configuration        | Throughput (1 Client) | Throughput (6 clients)|
|----------------------|-----------|-----------|
| Single + Client      | 73.6   | 96.0  |
| Ensemble + Python    | 71.9   | 71.2 |
| Ensemble + DALI      | 70.8   | 70.0  |

<img width="1950" height="2025" alt="Image" src="https://github.com/user-attachments/assets/b4d5fcd2-2c4a-4fe4-973c-ffc8a59bdbd5" />

<img width="1950" height="2025" alt="Image" src="https://github.com/user-attachments/assets/ee53f646-0da1-4538-9650-f2c8d17b7771" />


## Observations
Using the optimal configurations from Model Analyzer for both the Python and DALI backends, the Ensemble pipeline consistently shows lower throughput than the Single Pipeline (client-side orchestration):

1 client: Ensemble is ~2–4% slower
6 clients: Ensemble is ~25% slower

This is the opposite of what we expected when moving to an ensemble. The performance gap also widens with concurrency — the more inferences in flight, the more the ensemble falls behind.

## Questions

Is it expected that an ensemble (with BLS) shows only marginal — or even negative — improvement over separate client-side model calls at high concurrency?

Our hypothesis was that eliminating the second gRPC round-trip and keeping tensors server-side would yield a noticeable speedup, but the benefit appears to be minimal (or negative under load). Is this assumption wrong?

Is the Ensemble backend not recommended for workloads with a large number of concurrent requests? If so, what is the recommended pattern for chaining preprocessing → inference → postprocessing when throughput is the primary concern?

Any guidance on whether these results are in line with what's expected for this
type of pipeline, or if there are optimizations I am missing, would be very
appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensemble Model only marginally faster than separate single-model calls -- is this expected? #8740

Setup

Pipeline Architectures

Results before Model Analyzer

Pipeline Throughput (FPS) — concurrency sweep, no dynamic batching, 1 client

Pipeline Throughput with dynamic batching enabled (concurrency=8), 1 client

GPU utilisation (concurrency=8)

Results with Model Analyzer

Pipeline Throughput (comparison 1 vs 6 clients)

Observations

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Backend	Precision	Max Batch
`rt-detr`	TensorRT (GPU)	FP16	8
`reid`	TensorRT (GPU)	FP16	256
`detect_preprocessing`	Python (CPU) or DALI (GPU)	—	—
`detect_postprocessing_reid`	Python (CPU), BLS	—	—

Pipeline	Pre-processing	c=1	c=4	c=8
Single	Client	8.0 FPS	74.3 FPS	73.0 FPS
Ensemble	Client	11.6 FPS	77.4 FPS	75.3 FPS
Ensemble	Python Backend	13.0 FPS	38.0 FPS	61.4 FPS
Ensemble	DALI Backend	12.7 FPS	39.7 FPS	67.7 FPS

Configuration	No DM	DM	Delta
Single + Client	73.0 FPS	73.6 FPS	+1%
Ensemble + Client	75.3 FPS	78.9 FPS	+5%
Ensemble + Python	61.4 FPS	56.2 FPS	−8%
Ensemble + DALI	67.7 FPS	66.2 FPS	−2%

Configuration	GPU Util (%)	CPU (%)
Single + Client	51–54	41–43
Ensemble + Client	54–55	52
Ensemble + Python	44–46	57
Ensemble + DALI	52–54	60–62

Configuration	Throughput (1 Client)	Throughput (6 clients)
Single + Client	73.6	96.0
Ensemble + Python	71.9	71.2
Ensemble + DALI	70.8	70.0

Ensemble Model only marginally faster than separate single-model calls -- is this expected? #8740

Description

Setup

Pipeline Architectures

Results before Model Analyzer

Pipeline Throughput (FPS) — concurrency sweep, no dynamic batching, 1 client

Pipeline Throughput with dynamic batching enabled (concurrency=8), 1 client

GPU utilisation (concurrency=8)

Results with Model Analyzer

Pipeline Throughput (comparison 1 vs 6 clients)

Observations

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions