-
Notifications
You must be signed in to change notification settings - Fork 990
Description
Description:
When running Augur via docker-compose down (especially noticeable in the "End-to-end test (Podman)" CI pipeline), the augur-1 backend container frequently fails to shut down within the default 10-second timeout. This causes Docker/Podman to forcefully SIGKILL the container, resulting in daemon errors and test suite failures like:
Error response from daemon: given PID did not die within timeout
Root Cause:
The AugurServiceManager.shutdown_signal_handler() in augur/application/service_manager.py processes the shutdown sequence serially and launches slow blocking subprocesses:
- Sequential Waits: It sends a
.terminate()to the Gunicorn server and waits up to 5 seconds. Then it loops through the Celery workers, terminating and waiting up to 3 seconds for each. Then it terminates Celery Beat and waits up to 3 seconds. - Redundant CLI Subprocesses: After the processes stop,
clear_redis_caches()executes acelery purge -fcommand via a new Pythonsubprocess.call. Booting this CLI environment takes ~3 seconds by itself, and is fully redundant because the method directly below it (clear_rabbitmq_messages) natively runscelery_app.control.purge(). - Hanging cURL Command: Inside
clear_all_message_queues(), a rawcurlcommand is used to hit the RabbitMQ management API. However, it is hardcoded tohttp://localhost:15672(instead of using the parsed RabbitMQ hostname). In a Docker environment,localhostoften hangs indefinitely untilcurlreaches its default extreme timeout.
Because the maximum 10-second Docker SIGTERM timer is ticking, this sequence of blocking waits (5s + 3s + 3s + 3s + hanging curl) practically guarantees a zombie timeout.
Proposed Solution:
The shutdown_signal_handler needs to be refactored for concurrency:
- Broadcast Terminate: Send
.terminate()to the Server, Beat, and all Celery Workers simultaneously in one pass, before calling.wait()on any of them. - Remove CLI Purge: Delete the
celery -A ... purge -fsubprocess call inclear_redis_caches(), relying instead on the much faster nativecelery_app.control.purge()function. - Fix cURL Command: Update the cleanup curl command to use
parsed.hostname(e.g.rabbitmq) instead oflocalhost, and inject strict--connect-timeout 2 --max-time 2flags so it fails fast if the RabbitMQ container has already stopped.