Skip to content

fix: RequestTracker counter mismatch#483

Merged
mattwittwer merged 6 commits intomainfrom
mwittwer/ensemble-schedule-steps-counter-cleanup
Apr 14, 2026
Merged

fix: RequestTracker counter mismatch#483
mattwittwer merged 6 commits intomainfrom
mwittwer/ensemble-schedule-steps-counter-cleanup

Conversation

@mattwittwer
Copy link
Copy Markdown
Contributor

@mattwittwer mattwittwer commented Mar 30, 2026

What does the PR do?

Fixes a RequestTracker counter and lifetime mismatch in ScheduleSteps() that could prematurely release the top-level ensemble request when a parallel step failed to enqueue, typically surfacing after FAILED_ENQUEUE / Exceeds maximum queue size as a SIGSEGV during ensemble error finalization.

Issue
When multiple steps were prepared in a single ScheduleSteps() pass, the per-step cleanup path always called RequestTracker::DecrementCounter(), even for steps that never called IncrementCounter() because scheduling had already failed or been skipped. That could corrupt inflight_request_counter_ and release the top-level request while FinishEnsemble() was still using it, leaving later error-handling and cleanup paths operating on invalid request-tracker state.

Fix
Preserve the shared per-step max_inflight_requests limiter flow from main while correcting the failure path in ScheduleSteps().

Keep the top-level RequestTracker alive across async release and finalization using std::shared_ptr, and route cancellation, logging, and error response paths through synchronized helper methods.

Pass a heap-allocated RequestTrackerReference through the C release callback and explicitly clean it up for prepared steps that never reach the callback-owned release path.

Only call RequestTracker::DecrementCounter() for steps that actually called IncrementCounter(), while still decrementing inflight_step_counter_ for unscheduled steps and releasing any acquired limiter slot.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

triton-inference-server/server#8722

Where should the reviewer start?

The request processing follows this order:
PrepareSteps() in src/ensemble_scheduler/ensemble_scheduler.cc: 1011-1015
ScheduleSteps() decision to schedule and take the lifetime ref: 1467-1474
ScheduleSteps() handoff to InferAsync() and failure capture: 1496-1506
ScheduleSteps() local cleanup for unscheduled / failed-dispatch steps: 1512-1528
FinishEnsemble() final release of the top-level tracker: 1292-1297

Test plan:

Test case here:
triton-inference-server/server#8722

  • CI Pipeline ID:

47868062

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@mattwittwer mattwittwer self-assigned this Mar 30, 2026
@mattwittwer mattwittwer force-pushed the mwittwer/ensemble-schedule-steps-counter-cleanup branch from f3c395f to c0467bb Compare April 2, 2026 23:36
@mattwittwer mattwittwer changed the title draft: RequestTracker counter mismatch in ScheduleSteps with parallel fa… fix: RequestTracker counter mismatch in ScheduleSteps with parallel fa… Apr 6, 2026
@mattwittwer mattwittwer changed the title fix: RequestTracker counter mismatch in ScheduleSteps with parallel fa… fix: RequestTracker counter mismatch Apr 6, 2026
@mattwittwer mattwittwer requested review from pskiran1, whoisj and yinggeh and removed request for whoisj April 7, 2026 00:19
irequest->SetResponseCallback(
reinterpret_cast<ResponseAllocator*>(allocator_.get()), step->get(),
ResponseComplete, step->get());
irequest->SetReleaseCallback(RequestComplete, request_tracker_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the release callback do? if it calls free() on request_tracker_ we have an ownership problem because the std::shared_ptr<T> will do the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestComplete() does not directly free the RequestTracker object.

In InitStep(), the release callback userp is a heap-allocated RequestTrackerReference (std::shared_ptr<RequestTracker>), not the raw RequestTracker*. Then in RequestComplete(), we destroy that heap-allocated shared_ptr wrapper, which only drops one shared reference.

The callback also deletes the internal step TRITONSERVER_InferenceRequest* and calls DecrementCounter(), but that does not delete the RequestTracker object. The tracker lifetime is owned by EnsembleContext::request_tracker_ plus any outstanding callback-held references, and the context drops its own reference later in FinishEnsemble(). There is also an early-failure cleanup in ScheduleSteps() if InferAsync() fails before the release callback runs.

So there should not be a double-free here: the callback no longer does delete request_tracker; it only drops its shared reference.

irequest->SetReleaseCallback(RequestComplete, request_tracker_ref.get());

RETURN_IF_ERROR(irequest->PrepareForInference());
(*step)->tracker_ref_ = request_tracker_ref.release();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is happening here?

are you releasing ownership of an allocation managed by request_tracker_ref and assigning ownership to tracker_ref_ at the same time?

Why is request_tracking_ref not a std::shared_ptr to begin with? Seems to me that it would make this whole dance less confusing and more obvious as to what is happening here.

Is there a reason to use std::unique_ptr?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. RequestTrackerReference is already a std::shared_ptr<RequestTracker> (here), so the std::unique_ptr in InitStep() is not owning the RequestTracker itself. It is only owning the heap allocation that we pass through the C void* userp release-callback API.

So there are two layers:

  • the inner shared_ptr<RequestTracker> keeps the actual RequestTracker alive
  • the outer heap allocation is just the callback payload that gives SetReleaseCallback() a stable address to store in userp

The flow is:

  1. InitStep() allocates a heap RequestTrackerReference from request_tracker_, which adds one shared reference to the tracker.
  2. Before PrepareForInference() succeeds, that heap allocation is owned by the local unique_ptr, so an early return cleans it up automatically.
  3. After PrepareForInference() succeeds, ownership of the heap allocation is transferred to step->callback_tracker_ref_.
  4. On the normal path, RequestComplete() adopts that raw pointer into a local unique_ptr, which deletes the heap allocation when the callback exits.
  5. On the failure path where the request never reaches the callback-owned release path, ScheduleSteps() deletes step->callback_tracker_ref_ directly.
  6. Later, FinishEnsemble() drops the context's own shared_ptr with request_tracker_.reset().

So this is not a double-free of RequestTracker: the single-owner heap callback payload is destroyed in exactly one place, and destroying that payload only drops one shared_ptr<RequestTracker> reference.

Copy link
Copy Markdown
Contributor

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattwittwer mattwittwer merged commit feede8e into main Apr 14, 2026
1 check passed
mattwittwer added a commit that referenced this pull request Apr 15, 2026
* Fix RequestTracker counter mismatch in ScheduleSteps with parallel failures

(cherry picked from commit feede8e)
nightflight-dk pushed a commit to nightflight-dk/core that referenced this pull request Apr 22, 2026
* Fix RequestTracker counter mismatch in ScheduleSteps with parallel failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants