fix: async model load/unload to prevent evhtp thread starvation by itsnothuy · Pull Request #8737 · triton-inference-server/server

itsnothuy · 2026-04-14T09:23:09Z

What does the PR do?

Fixes synchronous model load/unload calls blocking evhtp worker threads
in HandleRepositoryControl(), causing HTTP thread starvation under
concurrent load/unload traffic. Health probes, inference requests, and
metadata queries are starved until all load/unload operations complete.

The fix applies the same evhtp_request_pause → detached std::thread
→ evthr_defer async pattern already used by InferRequestClass for
inference requests. A new ControlRequestClass (modelled exactly on
InferRequestClass) handles the lifecycle: capture the owning evhtp
thread, pause the request to free the worker, run the blocking
TRITONSERVER_ServerLoadModelWithParameters /
TRITONSERVER_ServerUnloadModel call on a detached thread, then post
the reply back via evthr_defer. An std::atomic<int> counter gates
concurrency and returns HTTP 503 when the limit is reached, preventing
thread explosion.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

None. This change is self-contained to src/http_server.h, src/http_server.cc, and src/main.cc. It does not touch the gRPC, SageMaker, or Vertex AI servers, or any backend.

Where should the reviewer start?

src/http_server.h — ControlRequestClass (new inner class,
~65 lines after InferRequestClass's enclosing section). Compare
side-by-side with InferRequestClass to verify the async pattern is
identical.
src/http_server.cc — HandleRepositoryControl() (~line 1447).
The load path and unload path both follow the same structure:
parse synchronously → gate with fetch_add → new ControlRequestClass(req)
→ std::thread(...).detach() → inside thread: call API, fetch_sub,
evthr_defer(ReplyCallback).
src/main.cc — StartHttpService() (~line 138). One-liner:
model_load_thread_count_ inserted as the new
control_request_concurrency argument to HTTPAPIServer::Create().
Constructor init in src/http_server.cc (~line 1167):
max_control_requests_ = control_request_concurrency + log line.

Test plan:

Manual reproduction of the bug (before this fix):

# 1. Start Triton with a slow-loading Python model (time.sleep(60) in initialize())
tritonserver \
  --model-repository=/models \
  --model-control-mode=explicit \
  --http-thread-count=2 \
  --model-load-thread-count=2

# 2. In parallel: issue 2 concurrent load requests + 1 health probe
curl -X POST localhost:8000/v2/repository/models/slow_model_1/load &
curl -X POST localhost:8000/v2/repository/models/slow_model_2/load &
# Before fix: this hangs for 60s — all evhtp threads are blocked:
curl localhost:8000/v2/health/live

# After fix: health probe returns 200 immediately even while loads are in-flight

Existing test coverage:

No new L0 test added in this PR. The fix follows the identical pattern
as InferRequestClass which is covered by existing qa/L0_http/
inference tests (verifying the evhtp_request_pause /
evthr_defer lifecycle is exercised).
Existing qa/L0_http/ tests should pass without regression; the
async path is only taken when --model-load-thread-count > 0
(default 4).

Note: I do not yet have access to a Linux + CUDA build environment
to run the full CI locally. I am happy to add an L0 test in a follow-up
PR once the approach is confirmed acceptable, or if a reviewer can
advise on the test infrastructure setup.

Caveats:

--model-load-thread-count dual use: This flag now controls both
Triton's internal model loading parallelism
(TRITONSERVER_ServerOptionsSetModelLoadThreadCount) and the HTTP
concurrency limit for load/unload requests. Setting it to 0 disables
the async path and falls back to the original synchronous behaviour.
Documentation update for this dual use is pending and can be done in a
follow-up.
Shutdown safety: Detached threads hold raw pointers to
HTTPAPIServer members (control_request_cnt_, server_.get()).
If the server is destroyed while a model load/unload is in-flight, a
use-after-free is theoretically possible. This is the same
pre-existing risk accepted by InferRequestClass; the fix does not
introduce a new hazard. A comprehensive fix (e.g., shared_ptr
to the counter, or joining on shutdown) should address all async paths
simultaneously in a separate PR.
evthr_defer return value unchecked: Same as InferRequestClass
line 4017 in the existing code. If the event loop has stopped,
ctrl_req will leak. Not introduced by this PR.
Second Create() overload (options-map path, used in some
embedded/SageMaker deployments): hardcodes control_concurrency = 4
and reads from an options key "control_request_concurrency" that is
not currently set by any caller. Normal main.cc startup is
unaffected. Happy to align this in a follow-up if the reviewer prefers.

Background

Reported in issue #8635 by @aleksn7. Root cause confirmed:
HandleRepositoryControl() in src/http_server.cc calls
TRITONSERVER_ServerLoadModelWithParameters() synchronously on the evhtp
worker thread with no evhtp_request_pause() before the call. The evhtp
thread is fully blocked for the duration of model initialization
(potentially minutes for large models). With --http-thread-count=N,
N concurrent load requests exhaust the entire thread pool, starving
inference and Kubernetes health probes.

@aleksn7 provided explicit guidance on the approach in the issue:

"another approach is to skip the thread pool entirely and create a new
thread for each model_repository call … don't forget to limit the
concurrent request count … you could use --model-load-thread-count
CLI parameter."

This PR implements exactly that approach.

The InferRequestClass async pattern (lines 3931–3944, 3887–3910 of
src/http_server.cc) was used as the template. ControlRequestClass
is a simplified version: no inference cancellation in the fini hook
(not needed — there is no in-flight inference to cancel), no response
count tracking. All other mechanics (evhtp_request_pause,
htpconn->thread capture, evhtp_hook_on_request_fini,
evthr_defer, evhtp_send_reply + evhtp_request_resume in the
callback) are identical.

Related Issues:

Fixes HTTP Connection Distribution Imbalance in evhtp Causes Sequential Request Processing #8635

Fixes triton-inference-server#8635 HandleRepositoryControl() was calling TRITONSERVER_ServerLoadModelWithParameters() and TRITONSERVER_ServerUnloadModel() synchronously on evhtp worker threads. Under concurrent model load/unload traffic all http-thread-count evhtp workers could become blocked, starving inference requests, health probes, and metadata queries. Apply the same async pattern used by InferRequestClass for inference: - New ControlRequestClass captures the evhtp thread and calls evhtp_request_pause(req) to free the worker immediately. - A detached std::thread executes the blocking TRITONSERVER API call. - evthr_defer posts ReplyCallback back onto the original evhtp thread where evhtp_send_reply() + evhtp_request_resume() execute safely. - An std::atomic<int> concurrency gate (sourced from --model-load-thread-count, default 4) caps concurrent detached threads and returns HTTP 503 when the limit is exceeded, preventing unbounded thread creation. - When --model-load-thread-count=0 the handler falls back to the original synchronous behavior for backward compatibility.

Copilot

Pull request overview

This PR fixes HTTP evhtp worker thread starvation during model repository load/unload by offloading the blocking Triton load/unload calls onto detached threads and replying back on the owning evhtp thread via evthr_defer, with a concurrency gate to prevent unbounded thread creation.

Changes:

Add an async request lifecycle for repository control (load/unload) modeled after the existing inference async pattern (evhtp_request_pause → detached thread → evthr_defer reply).
Introduce a concurrency limit for concurrent load/unload operations and return HTTP 503 when exceeded.
Wire the new control-request concurrency parameter through HTTPAPIServer::Create() from --model-load-thread-count.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
src/main.cc	Passes `model_load_thread_count_` into `HTTPAPIServer::Create()` as the control-request concurrency limit.
src/http_server.h	Adds `ControlRequestClass` and new control-request concurrency members; updates `Create()`/ctor signatures.
src/http_server.cc	Implements async load/unload handling in `HandleRepositoryControl()`, initializes concurrency limit, updates `Create()` overloads.

Comments suppressed due to low confidence (1)

src/http_server.cc:4944

This comment says "Use model_load_thread_count as concurrency limit" but the code actually reads control_request_concurrency from the options map (default 4). Please update the comment to match the implementation, or rename the option/key if the intent is to wire through model_load_thread_count.

    return;
  } else if (RE2::FullMatch(
                 std::string(req->uri->path->full), modelcontrol_regex_,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Move ControlRequestClass::ReplyCallback definition to .cc file (was using anonymous namespace symbols from header - wouldn't compile) - Remove unused <functional> include from http_server.h - Fix inaccurate comment about std::function copyability - Wrap std::thread creation in try/catch for both load and unload paths to handle std::system_error if thread creation fails under resource exhaustion (decrement counter, resume request, cleanup ctrl_req) Fixes triton-inference-server#8635

…ce-server#8737

Copilot AI review requested due to automatic review settings April 14, 2026 09:23

Copilot started reviewing on behalf of itsnothuy April 14, 2026 09:23 View session

Merge branch 'main' into fix-8635-evhtp-thread-imbalance

44568b0

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread src/http_server.cc

Comment thread src/http_server.cc

Comment thread src/http_server.cc

Comment thread src/http_server.h Outdated

Comment thread src/http_server.h Outdated

Comment thread src/http_server.h Outdated

itsnothuy and others added 2 commits April 14, 2026 02:40

Update src/http_server.h

a1b9861

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

itsnothuy added a commit to itsnothuy/server that referenced this pull request Apr 14, 2026

docs: add postmortem for Copilot review findings on PR triton-inferen…

8f27c45

…ce-server#8737

Merge branch 'main' into fix-8635-evhtp-thread-imbalance

2dbcc08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: async model load/unload to prevent evhtp thread starvation#8737

fix: async model load/unload to prevent evhtp thread starvation#8737
itsnothuy wants to merge 5 commits intotriton-inference-server:mainfrom
itsnothuy:fix-8635-evhtp-thread-imbalance

itsnothuy commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

itsnothuy commented Apr 14, 2026

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants