fix: async model load/unload to prevent evhtp thread starvation#8737
Open
itsnothuy wants to merge 5 commits intotriton-inference-server:mainfrom
Open
fix: async model load/unload to prevent evhtp thread starvation#8737itsnothuy wants to merge 5 commits intotriton-inference-server:mainfrom
itsnothuy wants to merge 5 commits intotriton-inference-server:mainfrom
Conversation
Fixes triton-inference-server#8635 HandleRepositoryControl() was calling TRITONSERVER_ServerLoadModelWithParameters() and TRITONSERVER_ServerUnloadModel() synchronously on evhtp worker threads. Under concurrent model load/unload traffic all http-thread-count evhtp workers could become blocked, starving inference requests, health probes, and metadata queries. Apply the same async pattern used by InferRequestClass for inference: - New ControlRequestClass captures the evhtp thread and calls evhtp_request_pause(req) to free the worker immediately. - A detached std::thread executes the blocking TRITONSERVER API call. - evthr_defer posts ReplyCallback back onto the original evhtp thread where evhtp_send_reply() + evhtp_request_resume() execute safely. - An std::atomic<int> concurrency gate (sourced from --model-load-thread-count, default 4) caps concurrent detached threads and returns HTTP 503 when the limit is exceeded, preventing unbounded thread creation. - When --model-load-thread-count=0 the handler falls back to the original synchronous behavior for backward compatibility.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes HTTP evhtp worker thread starvation during model repository load/unload by offloading the blocking Triton load/unload calls onto detached threads and replying back on the owning evhtp thread via evthr_defer, with a concurrency gate to prevent unbounded thread creation.
Changes:
- Add an async request lifecycle for repository control (load/unload) modeled after the existing inference async pattern (
evhtp_request_pause→ detached thread →evthr_deferreply). - Introduce a concurrency limit for concurrent load/unload operations and return HTTP 503 when exceeded.
- Wire the new control-request concurrency parameter through
HTTPAPIServer::Create()from--model-load-thread-count.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/main.cc | Passes model_load_thread_count_ into HTTPAPIServer::Create() as the control-request concurrency limit. |
| src/http_server.h | Adds ControlRequestClass and new control-request concurrency members; updates Create()/ctor signatures. |
| src/http_server.cc | Implements async load/unload handling in HandleRepositoryControl(), initializes concurrency limit, updates Create() overloads. |
Comments suppressed due to low confidence (1)
src/http_server.cc:4944
- This comment says "Use model_load_thread_count as concurrency limit" but the code actually reads
control_request_concurrencyfrom the options map (default 4). Please update the comment to match the implementation, or rename the option/key if the intent is to wire through model_load_thread_count.
return;
} else if (RE2::FullMatch(
std::string(req->uri->path->full), modelcontrol_regex_,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Move ControlRequestClass::ReplyCallback definition to .cc file (was using anonymous namespace symbols from header - wouldn't compile) - Remove unused <functional> include from http_server.h - Fix inaccurate comment about std::function copyability - Wrap std::thread creation in try/catch for both load and unload paths to handle std::system_error if thread creation fails under resource exhaustion (decrement counter, resume request, cleanup ctrl_req) Fixes triton-inference-server#8635
itsnothuy
added a commit
to itsnothuy/server
that referenced
this pull request
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does the PR do?
Fixes synchronous model load/unload calls blocking evhtp worker threads
in
HandleRepositoryControl(), causing HTTP thread starvation underconcurrent load/unload traffic. Health probes, inference requests, and
metadata queries are starved until all load/unload operations complete.
The fix applies the same
evhtp_request_pause→ detachedstd::thread→
evthr_deferasync pattern already used byInferRequestClassforinference requests. A new
ControlRequestClass(modelled exactly onInferRequestClass) handles the lifecycle: capture the owning evhtpthread, pause the request to free the worker, run the blocking
TRITONSERVER_ServerLoadModelWithParameters/TRITONSERVER_ServerUnloadModelcall on a detached thread, then postthe reply back via
evthr_defer. Anstd::atomic<int>counter gatesconcurrency and returns HTTP 503 when the limit is reached, preventing
thread explosion.
Checklist
Agreement
<commit_type>: <Title>pre-commit install, pre-commit run --all)Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
None. This change is self-contained to src/http_server.h, src/http_server.cc, and src/main.cc. It does not touch the gRPC, SageMaker, or Vertex AI servers, or any backend.
Where should the reviewer start?
src/http_server.h—ControlRequestClass(new inner class,~65 lines after
InferRequestClass's enclosing section). Compareside-by-side with
InferRequestClassto verify the async pattern isidentical.
src/http_server.cc—HandleRepositoryControl()(~line 1447).The load path and unload path both follow the same structure:
parse synchronously → gate with
fetch_add→new ControlRequestClass(req)→
std::thread(...).detach()→ inside thread: call API,fetch_sub,evthr_defer(ReplyCallback).src/main.cc—StartHttpService()(~line 138). One-liner:model_load_thread_count_inserted as the newcontrol_request_concurrencyargument toHTTPAPIServer::Create().Constructor init in
src/http_server.cc(~line 1167):max_control_requests_ = control_request_concurrency+ log line.Test plan:
Manual reproduction of the bug (before this fix):
Existing test coverage:
as
InferRequestClasswhich is covered by existingqa/L0_http/inference tests (verifying the
evhtp_request_pause/evthr_deferlifecycle is exercised).qa/L0_http/tests should pass without regression; theasync path is only taken when
--model-load-thread-count > 0(default 4).
Caveats:
--model-load-thread-countdual use: This flag now controls bothTriton's internal model loading parallelism
(
TRITONSERVER_ServerOptionsSetModelLoadThreadCount) and the HTTPconcurrency limit for load/unload requests. Setting it to
0disablesthe async path and falls back to the original synchronous behaviour.
Documentation update for this dual use is pending and can be done in a
follow-up.
Shutdown safety: Detached threads hold raw pointers to
HTTPAPIServermembers (control_request_cnt_,server_.get()).If the server is destroyed while a model load/unload is in-flight, a
use-after-free is theoretically possible. This is the same
pre-existing risk accepted by
InferRequestClass; the fix does notintroduce a new hazard. A comprehensive fix (e.g.,
shared_ptrto the counter, or joining on shutdown) should address all async paths
simultaneously in a separate PR.
evthr_deferreturn value unchecked: Same asInferRequestClassline 4017 in the existing code. If the event loop has stopped,
ctrl_reqwill leak. Not introduced by this PR.Second
Create()overload (options-map path, used in someembedded/SageMaker deployments): hardcodes
control_concurrency = 4and reads from an options key
"control_request_concurrency"that isnot currently set by any caller. Normal
main.ccstartup isunaffected. Happy to align this in a follow-up if the reviewer prefers.
Background
Reported in issue #8635 by @aleksn7. Root cause confirmed:
HandleRepositoryControl()insrc/http_server.cccallsTRITONSERVER_ServerLoadModelWithParameters()synchronously on the evhtpworker thread with no
evhtp_request_pause()before the call. The evhtpthread is fully blocked for the duration of model initialization
(potentially minutes for large models). With
--http-thread-count=N,N concurrent load requests exhaust the entire thread pool, starving
inference and Kubernetes health probes.
@aleksn7 provided explicit guidance on the approach in the issue:
This PR implements exactly that approach.
The
InferRequestClassasync pattern (lines 3931–3944, 3887–3910 ofsrc/http_server.cc) was used as the template.ControlRequestClassis a simplified version: no inference cancellation in the fini hook
(not needed — there is no in-flight inference to cancel), no response
count tracking. All other mechanics (
evhtp_request_pause,htpconn->threadcapture,evhtp_hook_on_request_fini,evthr_defer,evhtp_send_reply+evhtp_request_resumein thecallback) are identical.
Related Issues: