Hi everyone,
I am evaluating whether a Triton ensemble pipeline (with BLS for conditional
logic) can outperform a two-call single-model approach for an RT-DETR
detection + ReID re-identification pipeline. Server-side preprocessing (Python/DALI) and postprocessing (Python) are
significantly slower** than keeping pre/postprocessing on the client. I'd like to
understand if this is expected behavior or if there are configuration
improvements I'm missing.
Setup
- Triton Server: custom image with Python, TRT and Dali backend from 2.66.0 (
nvcr.io/nvidia/tritonserver:26.02-py3)
- GPU: NVIDIA RTX A2000
- Models:
| Model |
Backend |
Precision |
Max Batch |
rt-detr |
TensorRT (GPU) |
FP16 |
8 |
reid |
TensorRT (GPU) |
FP16 |
256 |
detect_preprocessing |
Python (CPU) or DALI (GPU) |
— |
— |
detect_postprocessing_reid |
Python (CPU), BLS |
— |
— |
Pre/Post processing steps in details:
- detect_preprocessing is doing:
- resizing
- RGB conversion
- normalisation
- detect_postprocessing_reid is doing:
- post processing on the bounding boxes
- cropping the detections from the original image
- calling ReID (sync)
- normalising the feature vectors
Pipeline Architectures
Single mode: The client makes two separate async gRPC calls per frame and
handles all pre/post-processing locally:
Client (pre/post-processing) → gRPC 1 → rt-detr → Client (post/pre) → gRPC 2 → reid → Client (post)
Ensemble mode: A single async gRPC call drives the full pipeline server-side. BLS
is used inside detect_postprocessing_reid so that reid is only invoked when
detections are present (conditional logic that the static ensemble DAG cannot
express):
raw_image → detect_preprocessing → rt-detr → detect_postprocessing_reid (BLS) → reid → outputs
Ensemble config.pbtxt:
platform: "ensemble"
max_batch_size: 4
input [
{
name: "raw_image"
data_type: TYPE_UINT8
dims: [ -1, -1, 3 ]
},
{
name: "confidence_threshold"
data_type: TYPE_FP32
dims: [ 1]
}
]
output [
{
name: "detection_boxes_out"
data_type: TYPE_FP32
dims: [ 4 ]
},
{
name: "detection_confidences_out"
data_type: TYPE_FP32
dims: [ 1 ]
},
{
name: "detection_labels_out"
data_type: TYPE_FP32
dims: [ 1 ]
},
{
name: "descriptors_out"
data_type: TYPE_FP32
dims: [ 128 ]
}
]
ensemble_scheduling {
step [
{
model_name: "detect_preprocessing"
model_version: -1
input_map {
key: "image"
value: "raw_image"
}
output_map {
key: "image"
value: "image"
}
},
{
model_name: "rt-detr"
model_version: -1
input_map {
key: "inputs"
value: "image"
}
output_map {
key: "pred_boxes"
value: "detection_boxes"
}
output_map {
key: "pred_logits"
value: "detection_logits"
}
},
{
model_name: "detect_postprocessing_reid"
model_version: -1
input_map {
key: "detection_logits"
value: "detection_logits"
}
input_map {
key: "detection_boxes"
value: "detection_boxes"
}
input_map {
key: "image"
value: "raw_image"
}
input_map {
key: "confidence_threshold"
value: "confidence_threshold"
}
output_map {
key: "detection_boxes"
value: "detection_boxes_out"
},
output_map {
key: "detection_confidences"
value: "detection_confidences_out"
},
output_map {
key: "detection_labels"
value: "detection_labels_out"
}
output_map {
key: "descriptors"
value: "descriptors_out"
}
}
]
}
Based on initial experiments, detect_preprocessing running as a Python backend
model appeared to be a bottleneck. To evaluate this, preprocessing was tested in
three variants:
- Client-side — preprocessing on the client before sending to the ensemble
(not part of the ensemble graph)
- Python backend — preprocessing on the server using OpenCV (CPU)
- DALI backend — preprocessing on the server using NVIDIA DALI (GPU)
Results before Model Analyzer
Pipeline Throughput (FPS) — concurrency sweep, no dynamic batching, 1 client
| Pipeline |
Pre-processing |
c=1 |
c=4 |
c=8 |
| Single |
Client |
8.0 FPS |
74.3 FPS |
73.0 FPS |
| Ensemble |
Client |
11.6 FPS |
77.4 FPS |
75.3 FPS |
| Ensemble |
Python Backend |
13.0 FPS |
38.0 FPS |
61.4 FPS |
| Ensemble |
DALI Backend |
12.7 FPS |
39.7 FPS |
67.7 FPS |
Pipeline Throughput with dynamic batching enabled (concurrency=8), 1 client
| Configuration |
No DM |
DM |
Delta |
| Single + Client |
73.0 FPS |
73.6 FPS |
+1% |
| Ensemble + Client |
75.3 FPS |
78.9 FPS |
+5% |
| Ensemble + Python |
61.4 FPS |
56.2 FPS |
−8% |
| Ensemble + DALI |
67.7 FPS |
66.2 FPS |
−2% |
GPU utilisation (concurrency=8)
| Configuration |
GPU Util (%) |
CPU (%) |
| Single + Client |
51–54 |
41–43 |
| Ensemble + Client |
54–55 |
52 |
| Ensemble + Python |
44–46 |
57 |
| Ensemble + DALI |
52–54 |
60–62 |
Below are the plots of the different average times. (dynamic batching enabled)
Instance Count sweep for Python and Dali backend
Results with Model Analyzer
After getting the mitigated results above, I attempted to use
Model Analyzer to find the optimal configuration for each ensemble variant. However, I could not profile my ensemble because it includes a BLS call.
Therefore, I removed the ReID BLS call from detection_postprocessing_reid to be able to run MA and get the best config (named config best in the plots).
Model analyzer results:
- detect_preprocessing_config_7: 4 CPU instances with a max batch size of 4 on platform python/dali
- rt-detr_config_3: 1 GPU instance with a max batch size of 4 on platform tensorrt
- detect_postprocessing_reid_config_1: 2 CPU instance with a max batch size of 4 on platform python
Pipeline Throughput (comparison 1 vs 6 clients)
| Configuration |
Throughput (1 Client) |
Throughput (6 clients) |
| Single + Client |
73.6 |
96.0 |
| Ensemble + Python |
71.9 |
71.2 |
| Ensemble + DALI |
70.8 |
70.0 |
Observations
Using the optimal configurations from Model Analyzer for both the Python and DALI backends, the Ensemble pipeline consistently shows lower throughput than the Single Pipeline (client-side orchestration):
1 client: Ensemble is ~2–4% slower
6 clients: Ensemble is ~25% slower
This is the opposite of what we expected when moving to an ensemble. The performance gap also widens with concurrency — the more inferences in flight, the more the ensemble falls behind.
Questions
Is it expected that an ensemble (with BLS) shows only marginal — or even negative — improvement over separate client-side model calls at high concurrency?
Our hypothesis was that eliminating the second gRPC round-trip and keeping tensors server-side would yield a noticeable speedup, but the benefit appears to be minimal (or negative under load). Is this assumption wrong?
Is the Ensemble backend not recommended for workloads with a large number of concurrent requests? If so, what is the recommended pattern for chaining preprocessing → inference → postprocessing when throughput is the primary concern?
Any guidance on whether these results are in line with what's expected for this
type of pipeline, or if there are optimizations I am missing, would be very
appreciated.
Hi everyone,
I am evaluating whether a Triton ensemble pipeline (with BLS for conditional
logic) can outperform a two-call single-model approach for an RT-DETR
detection + ReID re-identification pipeline. Server-side preprocessing (Python/DALI) and postprocessing (Python) are
significantly slower** than keeping pre/postprocessing on the client. I'd like to
understand if this is expected behavior or if there are configuration
improvements I'm missing.
Setup
nvcr.io/nvidia/tritonserver:26.02-py3)rt-detrreiddetect_preprocessingdetect_postprocessing_reidPre/Post processing steps in details:
Pipeline Architectures
Single mode: The client makes two separate async gRPC calls per frame and
handles all pre/post-processing locally:
Client (pre/post-processing) → gRPC 1 → rt-detr → Client (post/pre) → gRPC 2 → reid → Client (post)
Ensemble mode: A single async gRPC call drives the full pipeline server-side. BLS
is used inside
detect_postprocessing_reidso thatreidis only invoked whendetections are present (conditional logic that the static ensemble DAG cannot
express):
raw_image → detect_preprocessing → rt-detr → detect_postprocessing_reid (BLS) → reid → outputs
Ensemble config.pbtxt:
Based on initial experiments,
detect_preprocessingrunning as a Python backendmodel appeared to be a bottleneck. To evaluate this, preprocessing was tested in
three variants:
(not part of the ensemble graph)
Results before Model Analyzer
Pipeline Throughput (FPS) — concurrency sweep, no dynamic batching, 1 client
Pipeline Throughput with dynamic batching enabled (concurrency=8), 1 client
GPU utilisation (concurrency=8)
Below are the plots of the different average times. (dynamic batching enabled)
Instance Count sweep for Python and Dali backend
Results with Model Analyzer
After getting the mitigated results above, I attempted to use
Model Analyzer to find the optimal configuration for each ensemble variant. However, I could not profile my ensemble because it includes a BLS call.
Therefore, I removed the ReID BLS call from
detection_postprocessing_reidto be able to run MA and get the best config (namedconfig bestin the plots).Model analyzer results:
Pipeline Throughput (comparison 1 vs 6 clients)
Observations
Using the optimal configurations from Model Analyzer for both the Python and DALI backends, the Ensemble pipeline consistently shows lower throughput than the Single Pipeline (client-side orchestration):
1 client: Ensemble is ~2–4% slower
6 clients: Ensemble is ~25% slower
This is the opposite of what we expected when moving to an ensemble. The performance gap also widens with concurrency — the more inferences in flight, the more the ensemble falls behind.
Questions
Is it expected that an ensemble (with BLS) shows only marginal — or even negative — improvement over separate client-side model calls at high concurrency?
Our hypothesis was that eliminating the second gRPC round-trip and keeping tensors server-side would yield a noticeable speedup, but the benefit appears to be minimal (or negative under load). Is this assumption wrong?
Is the Ensemble backend not recommended for workloads with a large number of concurrent requests? If so, what is the recommended pattern for chaining preprocessing → inference → postprocessing when throughput is the primary concern?
Any guidance on whether these results are in line with what's expected for this
type of pipeline, or if there are optimizations I am missing, would be very
appreciated.