fix: RequestTracker counter mismatch#483
Conversation
f3c395f to
c0467bb
Compare
| irequest->SetResponseCallback( | ||
| reinterpret_cast<ResponseAllocator*>(allocator_.get()), step->get(), | ||
| ResponseComplete, step->get()); | ||
| irequest->SetReleaseCallback(RequestComplete, request_tracker_); |
There was a problem hiding this comment.
what does the release callback do? if it calls free() on request_tracker_ we have an ownership problem because the std::shared_ptr<T> will do the same.
There was a problem hiding this comment.
RequestComplete() does not directly free the RequestTracker object.
In InitStep(), the release callback userp is a heap-allocated RequestTrackerReference (std::shared_ptr<RequestTracker>), not the raw RequestTracker*. Then in RequestComplete(), we destroy that heap-allocated shared_ptr wrapper, which only drops one shared reference.
The callback also deletes the internal step TRITONSERVER_InferenceRequest* and calls DecrementCounter(), but that does not delete the RequestTracker object. The tracker lifetime is owned by EnsembleContext::request_tracker_ plus any outstanding callback-held references, and the context drops its own reference later in FinishEnsemble(). There is also an early-failure cleanup in ScheduleSteps() if InferAsync() fails before the release callback runs.
So there should not be a double-free here: the callback no longer does delete request_tracker; it only drops its shared reference.
| irequest->SetReleaseCallback(RequestComplete, request_tracker_ref.get()); | ||
|
|
||
| RETURN_IF_ERROR(irequest->PrepareForInference()); | ||
| (*step)->tracker_ref_ = request_tracker_ref.release(); |
There was a problem hiding this comment.
what is happening here?
are you releasing ownership of an allocation managed by request_tracker_ref and assigning ownership to tracker_ref_ at the same time?
Why is request_tracking_ref not a std::shared_ptr to begin with? Seems to me that it would make this whole dance less confusing and more obvious as to what is happening here.
Is there a reason to use std::unique_ptr?
There was a problem hiding this comment.
Good question. RequestTrackerReference is already a std::shared_ptr<RequestTracker> (here), so the std::unique_ptr in InitStep() is not owning the RequestTracker itself. It is only owning the heap allocation that we pass through the C void* userp release-callback API.
So there are two layers:
- the inner
shared_ptr<RequestTracker>keeps the actualRequestTrackeralive - the outer heap allocation is just the callback payload that gives
SetReleaseCallback()a stable address to store inuserp
The flow is:
InitStep()allocates a heapRequestTrackerReferencefromrequest_tracker_, which adds one shared reference to the tracker.- Before
PrepareForInference()succeeds, that heap allocation is owned by the localunique_ptr, so an early return cleans it up automatically. - After
PrepareForInference()succeeds, ownership of the heap allocation is transferred tostep->callback_tracker_ref_. - On the normal path,
RequestComplete()adopts that raw pointer into a localunique_ptr, which deletes the heap allocation when the callback exits. - On the failure path where the request never reaches the callback-owned release path,
ScheduleSteps()deletesstep->callback_tracker_ref_directly. - Later,
FinishEnsemble()drops the context's ownshared_ptrwithrequest_tracker_.reset().
So this is not a double-free of RequestTracker: the single-owner heap callback payload is destroyed in exactly one place, and destroying that payload only drops one shared_ptr<RequestTracker> reference.
* Fix RequestTracker counter mismatch in ScheduleSteps with parallel failures (cherry picked from commit feede8e)
* Fix RequestTracker counter mismatch in ScheduleSteps with parallel failures
What does the PR do?
Fixes a
RequestTrackercounter and lifetime mismatch inScheduleSteps()that could prematurely release the top-level ensemble request when a parallel step failed to enqueue, typically surfacing afterFAILED_ENQUEUE/Exceeds maximum queue sizeas a SIGSEGV during ensemble error finalization.Issue
When multiple steps were prepared in a single
ScheduleSteps()pass, the per-step cleanup path always calledRequestTracker::DecrementCounter(), even for steps that never calledIncrementCounter()because scheduling had already failed or been skipped. That could corruptinflight_request_counter_and release the top-level request whileFinishEnsemble()was still using it, leaving later error-handling and cleanup paths operating on invalid request-tracker state.Fix
Preserve the shared per-step
max_inflight_requestslimiter flow frommainwhile correcting the failure path inScheduleSteps().Keep the top-level
RequestTrackeralive across async release and finalization usingstd::shared_ptr, and route cancellation, logging, and error response paths through synchronized helper methods.Pass a heap-allocated
RequestTrackerReferencethrough the C release callback and explicitly clean it up for prepared steps that never reach the callback-owned release path.Only call
RequestTracker::DecrementCounter()for steps that actually calledIncrementCounter(), while still decrementinginflight_step_counter_for unscheduled steps and releasing any acquired limiter slot.Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/server#8722
Where should the reviewer start?
The request processing follows this order:
PrepareSteps() in src/ensemble_scheduler/ensemble_scheduler.cc: 1011-1015
ScheduleSteps() decision to schedule and take the lifetime ref: 1467-1474
ScheduleSteps() handoff to InferAsync() and failure capture: 1496-1506
ScheduleSteps() local cleanup for unscheduled / failed-dispatch steps: 1512-1528
FinishEnsemble() final release of the top-level tracker: 1292-1297
Test plan:
Test case here:
triton-inference-server/server#8722
47868062
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)