Skip to content

Healthcheck Endpoint#843

Open
james-lubin wants to merge 14 commits intomasterfrom
healthcheck-endpoint-no-interface
Open

Healthcheck Endpoint#843
james-lubin wants to merge 14 commits intomasterfrom
healthcheck-endpoint-no-interface

Conversation

@james-lubin
Copy link
Copy Markdown
Collaborator

@james-lubin james-lubin commented Apr 10, 2026

Context

Adds a health check endpoint that verifies that all workers for every active job in a given job cluster have the status of started. Originally, the plan was to also include an interface that would allow plugging in custom health checks (experimented with this here). However, I've pivoted away from this because it added a lot of complexity, and constrained callers. For customizing health checking, user should proxy this Akka route, and then add additional health checks afterwards.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

Test Results

162 files  ±0  162 suites  ±0   11m 34s ⏱️ +44s
781 tests +4  767 ✅ +2  11 💤 ±0  3 ❌ +2 
782 runs  +4  768 ✅ +2  11 💤 ±0  3 ❌ +2 

For more details on these failures, see this check.

Results for commit fb7bdf1. ± Comparison against base commit bd6ecdf.

♻️ This comment has been updated with latest results.

@james-lubin james-lubin changed the title Healthcheck endpoint no interface Healthcheck Endpoint Apr 10, 2026
@james-lubin james-lubin marked this pull request as ready for review April 10, 2026 18:52
logger.trace("GET /api/v1/jobClusters/{}/healthcheck called", clusterName);

return parameterMap(params -> {
String jobIdsParam = params.get("job-ids");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about we have users to provide the region, and we will determine the health for all jobs in that region?

here it seems we are expecting users to provide list of job ids, do we have use case where we only need inspect subset of jobs within a given region?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, it checks any active jobs. The job-ids param is optional, it just does the default check if it's null. But if you do specify it, it only checks the job-ids that were inserted.

This probably won't be used right away, it's more of a forward looking change. In the future, where we have a fully managed Mantis CD system, this will allow us to check just jobs that were recently deployed. It's also an additional knob that gives some of the power users more flexibility.

Comment on lines +2662 to +2666
if (worker.getState() != MantisJobState.Started) {
failedWorkers.add(new FailedWorker(
worker.getWorkerIndex(),
worker.getWorkerNumber(),
worker.getState().name()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is categorizing pending workers as failed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo the health logic is something like a composite of (all workers for each index have started, failed/resubmit workers count in past x duration is below a threshold; alert pack etc).

Copy link
Copy Markdown
Collaborator Author

@james-lubin james-lubin Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe I can improve the naming here, but the behavior is intentional, the idea is that after a deploy you can call this endpoint with n amount of retries until the workers have started. So if there are some stuck in accepted/pending the healthcheck will fail.

Alternatively, since we're using SSE, we could just keep the connection open and send an event when the workers start or a timeout is reached. But I think the way its implemented is easier to build on top of.

Edit: Didn't see your second comment until refreshing. Agreed on the composite, I'm just planning to add the rest of that within nfmantis.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change the name to something like "notReady"?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, went with unready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants