Conversation
Test Results162 files ±0 162 suites ±0 11m 34s ⏱️ +44s For more details on these failures, see this check. Results for commit fb7bdf1. ± Comparison against base commit bd6ecdf. ♻️ This comment has been updated with latest results. |
| logger.trace("GET /api/v1/jobClusters/{}/healthcheck called", clusterName); | ||
|
|
||
| return parameterMap(params -> { | ||
| String jobIdsParam = params.get("job-ids"); |
There was a problem hiding this comment.
what about we have users to provide the region, and we will determine the health for all jobs in that region?
here it seems we are expecting users to provide list of job ids, do we have use case where we only need inspect subset of jobs within a given region?
There was a problem hiding this comment.
By default, it checks any active jobs. The job-ids param is optional, it just does the default check if it's null. But if you do specify it, it only checks the job-ids that were inserted.
This probably won't be used right away, it's more of a forward looking change. In the future, where we have a fully managed Mantis CD system, this will allow us to check just jobs that were recently deployed. It's also an additional knob that gives some of the power users more flexibility.
| if (worker.getState() != MantisJobState.Started) { | ||
| failedWorkers.add(new FailedWorker( | ||
| worker.getWorkerIndex(), | ||
| worker.getWorkerNumber(), | ||
| worker.getState().name())); |
There was a problem hiding this comment.
this is categorizing pending workers as failed?
There was a problem hiding this comment.
imo the health logic is something like a composite of (all workers for each index have started, failed/resubmit workers count in past x duration is below a threshold; alert pack etc).
There was a problem hiding this comment.
Yeah, maybe I can improve the naming here, but the behavior is intentional, the idea is that after a deploy you can call this endpoint with n amount of retries until the workers have started. So if there are some stuck in accepted/pending the healthcheck will fail.
Alternatively, since we're using SSE, we could just keep the connection open and send an event when the workers start or a timeout is reached. But I think the way its implemented is easier to build on top of.
Edit: Didn't see your second comment until refreshing. Agreed on the composite, I'm just planning to add the rest of that within nfmantis.
There was a problem hiding this comment.
maybe change the name to something like "notReady"?
There was a problem hiding this comment.
good call, went with unready
Context
Adds a health check endpoint that verifies that all workers for every active job in a given job cluster have the status of
started. Originally, the plan was to also include an interface that would allow plugging in custom health checks (experimented with this here). However, I've pivoted away from this because it added a lot of complexity, and constrained callers. For customizing health checking, user should proxy this Akka route, and then add additional health checks afterwards.Checklist
./gradlew buildcompiles code correctly./gradlew testpasses all tests