Gracefully shut down services on test failure and context cancellation#63
Gracefully shut down services on test failure and context cancellation#63samschlegel wants to merge 4 commits intodzbarsky:masterfrom
Conversation
Services are not gracefully shut down when a test fails because os.Exit(1) is called before StopAll() in svcinit's main loop. This test verifies that services receive SIGTERM even when the inner test exits with a non-zero code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, os.Exit(1) and log.Fatal() were called before StopAll() when a test failed or a service crashed. This meant services were never sent SIGTERM and couldn't perform cleanup. Now we track the exit code, fall through to StopAll() and metrics reporting, then exit with the appropriate code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set cmd.Cancel to a no-op so that exec.CommandContext does not automatically SIGKILL child processes when the context is cancelled. Instead, StopAll() handles orderly shutdown in reverse dependency order using each service's configured shutdown_signal and shutdown_timeout. Also set WaitDelay to the configured shutdown timeout so cmd.Wait() doesn't block indefinitely if the process outlives the signal. Fixes dzbarsky#58 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35fad36540
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if shutdownTimeout, err := time.ParseDuration(s.ShutdownTimeout); err == nil && shutdownTimeout > waitDelay { | ||
| waitDelay = shutdownTimeout |
There was a problem hiding this comment.
Keep WaitDelay short to avoid false shutdown timeouts
Setting cmd.WaitDelay to shutdown_timeout means cmd.Wait() can now block for the full shutdown window when a service exits but leaves stdout/stderr pipes open (the exact orphaned-child case the previous 50ms delay handled). In that scenario ServiceInstance.StopWithSignal waits on s.isDone() and times out at the same duration, so it can log a spurious graceful-shutdown failure and send SIGKILL (or error when enforce_forceful_shutdown is enabled) even though the main process already handled SIGTERM. This is a behavioral regression from the prior short wait-delay logic.
Useful? React with 👍 / 👎.
|
Looks like StopAll isn't completely idempotent as it logs a bunch of stuff. Will clean that up since I think keeping the defer around is a good idea |
This branch fixes two related issues with service shutdown behavior in svcinit:
Fix 1: Gracefully shut down services on test failure (fixes #60 partially)
Problem: When a test failed (or a service exited uncleanly) in one-shot mode, svcinit called
os.Exit(1)orlog.Fatal()immediately, skippingStopAll(). Services never received SIGTERM and were orphaned or killed abruptly.Fix (
cmd/svcinit/main.go):defer r.StopAll()so services are always stopped on exitos.Exit(1)/log.Fatalwith exit code tracking, allowing the normal shutdown path to runStopAll()and exits with code 1Test: New integration test (
tests/graceful_shutdown_on_failure/) that runs a service with a failing test and verifies the service receives SIGTERM via a marker file.Fix 2: Prevent context cancellation from sending SIGKILL (fixes #58)
Problem:
exec.CommandContextautomatically sends SIGKILL when the context is cancelled. This bypasses the configuredshutdown_signal(SIGTERM) andshutdown_timeout, killing services instantly beforeStopAll()can perform orderly shutdown. A previous attempt (PR #59) tried usingcmd.Cancelto send SIGTERM, but that broke reverse-dependency ordering by signaling all services simultaneously.Fix (
runner/runner.go):cmd.Cancelto a no-op — context cancellation no longer kills processes at allStopAll()(called viadeferor explicitly) handles shutdown in the correct reverse dependency order using each service's configured signal and timeoutcmd.WaitDelayto the configuredshutdown_timeoutas a safety net socmd.Wait()doesn't block indefinitely