Skip to content

fix: Fix segfault during shutdown when using Triton metrics in Python backend#429

Open
aleksn7 wants to merge 7 commits intotriton-inference-server:mainfrom
aleksn7:main
Open

fix: Fix segfault during shutdown when using Triton metrics in Python backend#429
aleksn7 wants to merge 7 commits intotriton-inference-server:mainfrom
aleksn7:main

Conversation

@aleksn7
Copy link
Copy Markdown

@aleksn7 aleksn7 commented Feb 10, 2026

Problem
When shutting down the Triton Inference Server with Python backend while using Triton metrics, a segmentation fault occurs. This happens because Metric::Clear attempts to access the Stub singleton during the Stub destructor execution.

Stack trace:

#0  triton::backend::python::Metric::SaveToSharedMemory()
#1  triton::backend::python::Metric::Clear()
#2  triton::backend::python::MetricFamily::~MetricFamily()
#3  std::_Sp_counted_base<>::_M_release()
#4  pybind11::class_<MetricFamily>::dealloc()
#5  pybind11_object_dealloc()
#6  libpython3.12.so.1.0
#7  libpython3.12.so.1.0
#8  triton::backend::python::Stub::~Stub()
#9  main()

Solution
Added a new DestroyPythonObjects() method to the Stub class that explicitly releases all Python objects before the Stub destructor is invoked. This ensures proper destruction order:

DestroyPythonObjects() is called first, releasing all Python-bound objects (including MetricFamily instances)
Stub::~Stub() is called afterward, when no Python objects depend on it
This approach guarantees that MetricFamily and other Python objects are fully destroyed while Stub is still valid and accessible.

To reproduce
Run triton server with custom metric example model and shut it down by kill -2 <main triton server process pid>

@whoisj
Copy link
Copy Markdown
Contributor

whoisj commented Apr 10, 2026

@aleksn7 thank you for your contribution. Have you completed the required steps outlined in our Contributing section?

Specifically, have you submitted a signed CLA? If not, please do so.

@whoisj whoisj requested a review from yinggeh April 10, 2026 18:27
@yinggeh yinggeh requested review from pskiran1 and whoisj April 10, 2026 18:37
@yinggeh yinggeh added the bug Something isn't working label Apr 10, 2026
Comment thread src/pb_stub.cc Outdated
// objects. If the scoped_interpreter is destroyed before the stub object,
// this process will no longer hold the GIL lock and destruction of the stub
// will result in segfault.
stub->DestroyPythonObjects();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, but please move the call to Stub::DestroyInstance().

Copy link
Copy Markdown
Author

@aleksn7 aleksn7 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅

Comment thread src/pb_stub.cc Outdated
non_graceful_exit = true;

// Destroy stub and exit.
stub->DestroyPythonObjects();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, but please move this call to Stub::DestroyInstance().

Comment thread src/pb_stub.cc
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a shutdown-time segmentation fault in the Python backend when Triton custom metrics are enabled by ensuring Python-bound objects are released in a safe order relative to the Stub singleton teardown.

Changes:

  • Added Stub::DestroyPythonObjects() to explicitly drop Python-owned objects before Stub destruction.
  • Updated Stub::DestroyInstance() to call DestroyPythonObjects() prior to resetting the singleton.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/pb_stub.h Declares the new DestroyPythonObjects() API and documents intended destructor ordering.
src/pb_stub.cc Calls DestroyPythonObjects() during singleton destruction and implements the cleanup routine.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/pb_stub.cc
@@ -1088,6 +1088,7 @@ Stub::GetOrCreateInstance()
void
Stub::DestroyInstance()
{
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stub::DestroyInstance() unconditionally dereferences stub_instance. If DestroyInstance() is called before GetOrCreateInstance() (or called twice), this will crash. Add a null check (e.g., early-return if !stub_instance) before calling DestroyPythonObjects() / reset().

Suggested change
{
{
if (!stub_instance) {
return;
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅

Comment thread src/pb_stub.cc
py::gil_scoped_acquire acquire;
py::object async_event_loop_local(std::move(async_event_loop_));
py::object background_futures_local(std::move(background_futures_));
py::object model_instance_local(std::move(model_instance_));
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DestroyPythonObjects() only clears async_event_loop_, background_futures_, and model_instance_, but Stub also owns other py::object members (deserialize_bytes_, serialize_bytes_). If those remain non-empty, they will be decref'd later during Stub destruction (potentially after py::scoped_interpreter teardown / without the GIL), which can still segfault. Consider moving/clearing all py::object members here (and ideally reuse this helper from ~Stub() to keep the cleanup logic in one place).

Suggested change
py::object model_instance_local(std::move(model_instance_));
py::object model_instance_local(std::move(model_instance_));
py::object deserialize_bytes_local(std::move(deserialize_bytes_));
py::object serialize_bytes_local(std::move(serialize_bytes_));

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

@aleksn7 aleksn7 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whoisj What do you think about this? Should we listen copilot here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the other fields are not a problem, then it doesn't matter, honestly.

Comment thread src/pb_stub.h Outdated

/// Cleans up Python objects and must be called before the destructor.
/// This prevents problems that occur when Python object destructors
/// call Stub::GetOrCreate.
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions Stub::GetOrCreate, but the singleton accessor is Stub::GetOrCreateInstance(). Update the comment to match the actual API to avoid confusion when debugging shutdown/destructor-order issues.

Suggested change
/// call Stub::GetOrCreate.
/// call Stub::GetOrCreateInstance().

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅

Copy link
Copy Markdown
Contributor

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @yinggeh can you add you review as well?

@whoisj
Copy link
Copy Markdown
Contributor

whoisj commented Apr 14, 2026

This is still blocking:

@aleksn7 thank you for your contribution. Have you completed the required steps outlined in our Contributing section?

Specifically, have you submitted a signed CLA? If not, please do so.

@aleksn7
Copy link
Copy Markdown
Author

aleksn7 commented Apr 16, 2026

This is still blocking:

@aleksn7 thank you for your contribution. Have you completed the required steps outlined in our Contributing section?

Specifically, have you submitted a signed CLA? If not, please do so.

I'm working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Development

Successfully merging this pull request may close these issues.

4 participants