Skip to content

Backend service_manager.py gracefully shutdown times out in Podman/Docker causing Zombie SIGKILL failures #3728

@yashisthebatman

Description

@yashisthebatman

Description:
When running Augur via docker-compose down (especially noticeable in the "End-to-end test (Podman)" CI pipeline), the augur-1 backend container frequently fails to shut down within the default 10-second timeout. This causes Docker/Podman to forcefully SIGKILL the container, resulting in daemon errors and test suite failures like:
Error response from daemon: given PID did not die within timeout

Root Cause:
The AugurServiceManager.shutdown_signal_handler() in augur/application/service_manager.py processes the shutdown sequence serially and launches slow blocking subprocesses:

  1. Sequential Waits: It sends a .terminate() to the Gunicorn server and waits up to 5 seconds. Then it loops through the Celery workers, terminating and waiting up to 3 seconds for each. Then it terminates Celery Beat and waits up to 3 seconds.
  2. Redundant CLI Subprocesses: After the processes stop, clear_redis_caches() executes a celery purge -f command via a new Python subprocess.call. Booting this CLI environment takes ~3 seconds by itself, and is fully redundant because the method directly below it (clear_rabbitmq_messages) natively runs celery_app.control.purge().
  3. Hanging cURL Command: Inside clear_all_message_queues(), a raw curl command is used to hit the RabbitMQ management API. However, it is hardcoded to http://localhost:15672 (instead of using the parsed RabbitMQ hostname). In a Docker environment, localhost often hangs indefinitely until curl reaches its default extreme timeout.

Because the maximum 10-second Docker SIGTERM timer is ticking, this sequence of blocking waits (5s + 3s + 3s + 3s + hanging curl) practically guarantees a zombie timeout.

Proposed Solution:
The shutdown_signal_handler needs to be refactored for concurrency:

  1. Broadcast Terminate: Send .terminate() to the Server, Beat, and all Celery Workers simultaneously in one pass, before calling .wait() on any of them.
  2. Remove CLI Purge: Delete the celery -A ... purge -f subprocess call in clear_redis_caches(), relying instead on the much faster native celery_app.control.purge() function.
  3. Fix cURL Command: Update the cleanup curl command to use parsed.hostname (e.g. rabbitmq) instead of localhost, and inject strict --connect-timeout 2 --max-time 2 flags so it fails fast if the RabbitMQ container has already stopped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions