Healthcheck Endpoint by james-lubin · Pull Request #843 · Netflix/mantis

james-lubin · 2026-04-10T16:48:07Z

Context

Adds a health check endpoint that verifies that all workers for every active job in a given job cluster have the status of started. Originally, the plan was to also include an interface that would allow plugging in custom health checks (experimented with this here). However, I've pivoted away from this because it added a lot of complexity, and constrained callers. For customizing health checking, user should proxy this Akka route, and then add additional health checks afterwards.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

github-actions · 2026-04-10T16:52:08Z

Test Results

162 files ±0 162 suites ±0 11m 34s ⏱️ +44s
781 tests +4 767 ✅ +2 11 💤 ±0 3 ❌ +2
782 runs +4 768 ✅ +2 11 💤 ±0 3 ❌ +2

For more details on these failures, see this check.

Results for commit fb7bdf1. ± Comparison against base commit bd6ecdf.

♻️ This comment has been updated with latest results.

hellolittlej · 2026-04-10T22:17:08Z

...ontrol-plane-server/src/main/java/io/mantisrx/master/api/akka/route/v1/JobClustersRoute.java

+        logger.trace("GET /api/v1/jobClusters/{}/healthcheck called", clusterName);
+
+        return parameterMap(params -> {
+            String jobIdsParam = params.get("job-ids");


what about we have users to provide the region, and we will determine the health for all jobs in that region?

here it seems we are expecting users to provide list of job ids, do we have use case where we only need inspect subset of jobs within a given region?

By default, it checks any active jobs. The job-ids param is optional, it just does the default check if it's null. But if you do specify it, it only checks the job-ids that were inserted.

This probably won't be used right away, it's more of a forward looking change. In the future, where we have a fully managed Mantis CD system, this will allow us to check just jobs that were recently deployed. It's also an additional knob that gives some of the power users more flexibility.

Andyz26 · 2026-04-10T22:33:22Z

...mantis-control-plane-server/src/main/java/io/mantisrx/master/jobcluster/JobClusterActor.java

+                            if (worker.getState() != MantisJobState.Started) {
+                                failedWorkers.add(new FailedWorker(
+                                        worker.getWorkerIndex(),
+                                        worker.getWorkerNumber(),
+                                        worker.getState().name()));


this is categorizing pending workers as failed?

imo the health logic is something like a composite of (all workers for each index have started, failed/resubmit workers count in past x duration is below a threshold; alert pack etc).

Yeah, maybe I can improve the naming here, but the behavior is intentional, the idea is that after a deploy you can call this endpoint with n amount of retries until the workers have started. So if there are some stuck in accepted/pending the healthcheck will fail.

Alternatively, since we're using SSE, we could just keep the connection open and send an event when the workers start or a timeout is reached. But I think the way its implemented is easier to build on top of.

Edit: Didn't see your second comment until refreshing. Agreed on the composite, I'm just planning to add the rest of that within nfmantis.

maybe change the name to something like "notReady"?

good call, went with unready

james-lubin added 7 commits April 8, 2026 18:24

Add healthcheck endpoint

836f2f0

Refactor

0c66a84

Additional tweaks

5f240aa

Add tests

153ca32

No health check extension interface

8aefc73

Tweaks

828ff61

Refactor

2feb895

james-lubin added 3 commits April 10, 2026 12:53

More tweaks

2d1b5da

Modify tests

5abb3c4

Have unhealthy workers return 2xx status code

01240fd

james-lubin changed the title ~~Healthcheck endpoint no interface~~ Healthcheck Endpoint Apr 10, 2026

james-lubin marked this pull request as ready for review April 10, 2026 18:52

james-lubin requested review from Andyz26, calvin681, dtrager02, fdc-ntflx and hellolittlej as code owners April 10, 2026 18:52

hellolittlej reviewed Apr 10, 2026

View reviewed changes

Andyz26 reviewed Apr 10, 2026

View reviewed changes

james-lubin added 4 commits April 13, 2026 19:01

Rename workerFailure -> workersUnready

cb81244

UnreadyWorkers

b27700c

Update JsonProperty argument

ecab323

Force preStart to finish by calling getRegisteredTaskExecutors

fb7bdf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthcheck Endpoint#843

Healthcheck Endpoint#843
james-lubin wants to merge 14 commits intomasterfrom
healthcheck-endpoint-no-interface

james-lubin commented Apr 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

hellolittlej Apr 10, 2026

Uh oh!

james-lubin Apr 10, 2026

Uh oh!

Andyz26 Apr 10, 2026

Uh oh!

Andyz26 Apr 10, 2026

Uh oh!

james-lubin Apr 10, 2026 •

edited

Loading

Uh oh!

Andyz26 Apr 13, 2026

Uh oh!

james-lubin Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

james-lubin commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Checklist

Uh oh!

github-actions bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

hellolittlej Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

james-lubin Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Andyz26 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Andyz26 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

james-lubin Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andyz26 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

james-lubin Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

james-lubin commented Apr 10, 2026 •

edited

Loading

github-actions bot commented Apr 10, 2026 •

edited

Loading

james-lubin Apr 10, 2026 •

edited

Loading