Backend service_manager.py gracefully shutdown times out in Podman/Docker causing Zombie  SIGKILL failures

**Description:**
When running Augur via `docker-compose down` (especially noticeable in the "End-to-end test (Podman)" CI pipeline), the `augur-1` backend container frequently fails to shut down within the default 10-second timeout. This causes Docker/Podman to forcefully `SIGKILL` the container, resulting in daemon errors and test suite failures like:
`Error response from daemon: given PID did not die within timeout`

**Root Cause:**
The `AugurServiceManager.shutdown_signal_handler()` in `augur/application/service_manager.py` processes the shutdown sequence serially and launches slow blocking subprocesses:

1. **Sequential Waits:** It sends a `.terminate()` to the Gunicorn server and waits up to 5 seconds. *Then* it loops through the Celery workers, terminating and waiting up to 3 seconds for each. *Then* it terminates Celery Beat and waits up to 3 seconds. 
2. **Redundant CLI Subprocesses:** After the processes stop, `clear_redis_caches()` executes a `celery purge -f` command via a new Python `subprocess.call`. Booting this CLI environment takes ~3 seconds by itself, and is fully redundant because the method directly below it (`clear_rabbitmq_messages`) natively runs `celery_app.control.purge()`.
3. **Hanging cURL Command:** Inside `clear_all_message_queues()`, a raw `curl` command is used to hit the RabbitMQ management API. However, it is hardcoded to `http://localhost:15672` (instead of using the parsed RabbitMQ hostname). In a Docker environment, `localhost` often hangs indefinitely until `curl` reaches its default extreme timeout.

Because the maximum 10-second Docker `SIGTERM` timer is ticking, this sequence of blocking waits (5s + 3s + 3s + 3s + hanging curl) practically guarantees a zombie timeout.

**Proposed Solution:**
The `shutdown_signal_handler` needs to be refactored for concurrency:
1. **Broadcast Terminate:** Send `.terminate()` to the Server, Beat, and all Celery Workers simultaneously in one pass, *before* calling `.wait()` on any of them.
2. **Remove CLI Purge:** Delete the `celery -A ... purge -f` subprocess call in `clear_redis_caches()`, relying instead on the much faster native `celery_app.control.purge()` function.
3. **Fix cURL Command:** Update the cleanup curl command to use `parsed.hostname` (e.g. `rabbitmq`) instead of `localhost`, and inject strict `--connect-timeout 2 --max-time 2` flags so it fails fast if the RabbitMQ container has already stopped.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend service_manager.py gracefully shutdown times out in Podman/Docker causing Zombie SIGKILL failures #3728

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend service_manager.py gracefully shutdown times out in Podman/Docker causing Zombie SIGKILL failures #3728

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions