fix: Fix segfault during shutdown when using Triton metrics in Python backend#429
fix: Fix segfault during shutdown when using Triton metrics in Python backend#429aleksn7 wants to merge 7 commits intotriton-inference-server:mainfrom
Conversation
|
@aleksn7 thank you for your contribution. Have you completed the required steps outlined in our Contributing section? Specifically, have you submitted a signed CLA? If not, please do so. |
| // objects. If the scoped_interpreter is destroyed before the stub object, | ||
| // this process will no longer hold the GIL lock and destruction of the stub | ||
| // will result in segfault. | ||
| stub->DestroyPythonObjects(); |
There was a problem hiding this comment.
this makes sense, but please move the call to Stub::DestroyInstance().
| non_graceful_exit = true; | ||
|
|
||
| // Destroy stub and exit. | ||
| stub->DestroyPythonObjects(); |
There was a problem hiding this comment.
this makes sense, but please move this call to Stub::DestroyInstance().
There was a problem hiding this comment.
Pull request overview
Fixes a shutdown-time segmentation fault in the Python backend when Triton custom metrics are enabled by ensuring Python-bound objects are released in a safe order relative to the Stub singleton teardown.
Changes:
- Added
Stub::DestroyPythonObjects()to explicitly drop Python-owned objects beforeStubdestruction. - Updated
Stub::DestroyInstance()to callDestroyPythonObjects()prior to resetting the singleton.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/pb_stub.h | Declares the new DestroyPythonObjects() API and documents intended destructor ordering. |
| src/pb_stub.cc | Calls DestroyPythonObjects() during singleton destruction and implements the cleanup routine. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1088,6 +1088,7 @@ Stub::GetOrCreateInstance() | |||
| void | |||
| Stub::DestroyInstance() | |||
| { | |||
There was a problem hiding this comment.
Stub::DestroyInstance() unconditionally dereferences stub_instance. If DestroyInstance() is called before GetOrCreateInstance() (or called twice), this will crash. Add a null check (e.g., early-return if !stub_instance) before calling DestroyPythonObjects() / reset().
| { | |
| { | |
| if (!stub_instance) { | |
| return; | |
| } |
| py::gil_scoped_acquire acquire; | ||
| py::object async_event_loop_local(std::move(async_event_loop_)); | ||
| py::object background_futures_local(std::move(background_futures_)); | ||
| py::object model_instance_local(std::move(model_instance_)); |
There was a problem hiding this comment.
DestroyPythonObjects() only clears async_event_loop_, background_futures_, and model_instance_, but Stub also owns other py::object members (deserialize_bytes_, serialize_bytes_). If those remain non-empty, they will be decref'd later during Stub destruction (potentially after py::scoped_interpreter teardown / without the GIL), which can still segfault. Consider moving/clearing all py::object members here (and ideally reuse this helper from ~Stub() to keep the cleanup logic in one place).
| py::object model_instance_local(std::move(model_instance_)); | |
| py::object model_instance_local(std::move(model_instance_)); | |
| py::object deserialize_bytes_local(std::move(deserialize_bytes_)); | |
| py::object serialize_bytes_local(std::move(serialize_bytes_)); |
There was a problem hiding this comment.
@whoisj What do you think about this? Should we listen copilot here?
There was a problem hiding this comment.
if the other fields are not a problem, then it doesn't matter, honestly.
|
|
||
| /// Cleans up Python objects and must be called before the destructor. | ||
| /// This prevents problems that occur when Python object destructors | ||
| /// call Stub::GetOrCreate. |
There was a problem hiding this comment.
The comment mentions Stub::GetOrCreate, but the singleton accessor is Stub::GetOrCreateInstance(). Update the comment to match the actual API to avoid confusion when debugging shutdown/destructor-order issues.
| /// call Stub::GetOrCreate. | |
| /// call Stub::GetOrCreateInstance(). |
|
This is still blocking:
|
I'm working on it |
Problem
When shutting down the Triton Inference Server with Python backend while using Triton metrics, a segmentation fault occurs. This happens because Metric::Clear attempts to access the Stub singleton during the Stub destructor execution.
Stack trace:
Solution
Added a new DestroyPythonObjects() method to the Stub class that explicitly releases all Python objects before the Stub destructor is invoked. This ensures proper destruction order:
DestroyPythonObjects() is called first, releasing all Python-bound objects (including MetricFamily instances)
Stub::~Stub() is called afterward, when no Python objects depend on it
This approach guarantees that MetricFamily and other Python objects are fully destroyed while Stub is still valid and accessible.
To reproduce
Run triton server with custom metric example model and shut it down by
kill -2 <main triton server process pid>