Skip to content

Gracefully shut down services on test failure and context cancellation#63

Draft
samschlegel wants to merge 4 commits intodzbarsky:masterfrom
samschlegel:fix-graceful-shutdown-on-failure
Draft

Gracefully shut down services on test failure and context cancellation#63
samschlegel wants to merge 4 commits intodzbarsky:masterfrom
samschlegel:fix-graceful-shutdown-on-failure

Conversation

@samschlegel
Copy link
Copy Markdown

This branch fixes two related issues with service shutdown behavior in svcinit:

Fix 1: Gracefully shut down services on test failure (fixes #60 partially)

Problem: When a test failed (or a service exited uncleanly) in one-shot mode, svcinit called os.Exit(1) or log.Fatal() immediately, skipping StopAll(). Services never received SIGTERM and were orphaned or killed abruptly.

Fix (cmd/svcinit/main.go):

  • Added defer r.StopAll() so services are always stopped on exit
  • Replaced immediate os.Exit(1) / log.Fatal with exit code tracking, allowing the normal shutdown path to run
  • On service failure: cancels the test, waits for it to exit, then falls through to StopAll() and exits with code 1

Test: New integration test (tests/graceful_shutdown_on_failure/) that runs a service with a failing test and verifies the service receives SIGTERM via a marker file.

Fix 2: Prevent context cancellation from sending SIGKILL (fixes #58)

Problem: exec.CommandContext automatically sends SIGKILL when the context is cancelled. This bypasses the configured shutdown_signal (SIGTERM) and shutdown_timeout, killing services instantly before StopAll() can perform orderly shutdown. A previous attempt (PR #59) tried using cmd.Cancel to send SIGTERM, but that broke reverse-dependency ordering by signaling all services simultaneously.

Fix (runner/runner.go):

  • Set cmd.Cancel to a no-op — context cancellation no longer kills processes at all
  • StopAll() (called via defer or explicitly) handles shutdown in the correct reverse dependency order using each service's configured signal and timeout
  • Set cmd.WaitDelay to the configured shutdown_timeout as a safety net so cmd.Wait() doesn't block indefinitely

samschlegel and others added 4 commits March 12, 2026 20:53
Services are not gracefully shut down when a test fails because
os.Exit(1) is called before StopAll() in svcinit's main loop.
This test verifies that services receive SIGTERM even when the
inner test exits with a non-zero code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, os.Exit(1) and log.Fatal() were called before StopAll()
when a test failed or a service crashed. This meant services were
never sent SIGTERM and couldn't perform cleanup. Now we track the
exit code, fall through to StopAll() and metrics reporting, then
exit with the appropriate code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set cmd.Cancel to a no-op so that exec.CommandContext does not
automatically SIGKILL child processes when the context is cancelled.
Instead, StopAll() handles orderly shutdown in reverse dependency
order using each service's configured shutdown_signal and
shutdown_timeout.

Also set WaitDelay to the configured shutdown timeout so cmd.Wait()
doesn't block indefinitely if the process outlives the signal.

Fixes dzbarsky#58

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@samschlegel samschlegel marked this pull request as draft March 12, 2026 21:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35fad36540

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread runner/runner.go
Comment on lines +274 to +275
if shutdownTimeout, err := time.ParseDuration(s.ShutdownTimeout); err == nil && shutdownTimeout > waitDelay {
waitDelay = shutdownTimeout
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep WaitDelay short to avoid false shutdown timeouts

Setting cmd.WaitDelay to shutdown_timeout means cmd.Wait() can now block for the full shutdown window when a service exits but leaves stdout/stderr pipes open (the exact orphaned-child case the previous 50ms delay handled). In that scenario ServiceInstance.StopWithSignal waits on s.isDone() and times out at the same duration, so it can log a spurious graceful-shutdown failure and send SIGKILL (or error when enforce_forceful_shutdown is enabled) even though the main process already handled SIGTERM. This is a behavioral regression from the prior short wait-delay logic.

Useful? React with 👍 / 👎.

@samschlegel
Copy link
Copy Markdown
Author

Looks like StopAll isn't completely idempotent as it logs a bunch of stuff. Will clean that up since I think keeping the defer around is a good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Killing a run service doesn't cause svcinit to fail when running an itest_service, exec.CommandContext sends SIGKILL before graceful shutdown can run

1 participant