Adds support for multiple managers running distributed fate by keith-turner · Pull Request #6168 · apache/accumulo

keith-turner · 2026-03-04T00:08:58Z

Lays the foundation for multiple manager with the following changes. The best place to start looking at these changes is in the Manager.run() method which sets everything and ties it all together.

Each manager process acquires two zookeeper locks now, a primary lock and an assistant lock. Only one manager process can obtain the primary lock and when it does it assumes the role of primary manager. All manager processes acquire an assistant lock, which is similar to a tserver or compactor lock. The assistant lock advertises the manager process as being available to other Accumulo processes to handle assistant manager operations.
Manager processes have a single thrift server and thrift services hosted on that thrift server are categorized into primary manager and assistant manager services. When an assistant manager receives an RPC for a primary manager thrift service it will not execute the request and will throw an error or ignore the request.
The primary manager process delegates manager responsibility via RPCs to assistant managers.
Any management responsibility not delegated runs on the primary manager.

Using the changes above fate is now distributed across all manager processes. In the future the changes above should make it easy to delegate other responsibilities to assistant managers. The following is an outline of the fate changes.

New FateWorker class. This runs in every manager and handles request from the primary manager to adjust what range of the fate table its currently responsible for. FateWorker implements a new thrift service used to assign it ranges.
New FateManager class that is run by the primary manager and is responsible for partitioning fate processing across all assistant managers. As manager processes come and go this will repartition the fate table evenly across all available managers. The FateManager communicates with FateWorkers via thrift.
Some new RPCs for best effort notifications. Before these changes there were in memory notification systems that made the manager more responsive. These would allow a fate operation to signal the Tablet Group Watcher to take action sooner. FateWorkerEnv sends these notifications to the primary manger over a new RPC. Does not matter if they are lost, things will still eventually happen.

Other than fate, the primary manager process does everything the current manager does. This change pulls from #3262 and #6139.

Lays the foundation for multiple manager with the following changes. The best place to start looking at these changes is in the Manager.run() method which sets everything and ties it all together. * Each manager process acquires two zookeeper locks now, a primary lock and an assistant lock. Only one manager process can obtain the primary lock and when it does it assumes the role of primary manager. All manager processes acquire an assistant lock, which is similar to a tserver or compactor lock. The assistant lock advertises the manager process as being available to other Accumulo processes to handle assistant manager operations. * Manager processes have a single thrift server and thrift services hosted on that thrift server are categorized into primary manager and assistant manager services. When an assistant manager receives an RPC for a primary manager thrift service it will not execute the request and will throw an error or ignore the request. * The primary manager process delegates manager responsibility via RPCs to assistant managers. * Any management responsibility not delegated runs on the primary manager. Using the changes above fate is now distributed across all manager processes. In the future the changes above should make it easy to delegate other responsibilities to assistant managers. The following is an outline of the fate changes. * New FateWorker class. This runs in every manager and handles request from the primary manager to adjust what range of the fate table its currently responsible for. FateWorker implements a new thrift service used to assign it ranges. * New FateManager class that is run by the primary manager and is responsible for partitioning fate processing across all assistant managers. As manager processes come and go this will repartition the fate table evenly across all available managers. The FateManager communicates with FateWorkers via thrift. * Some new RPCs for best effort notifications. Before these changes there were in memory notification systems that made the manager more responsive. These would allow a fate operation to signal the Tablet Group Watcher to take action sooner. FateWorkerEnv sends these notifications to the primary manger over a new RPC. Does not matter if they are lost, things will still eventually happen. Other than fate, the primary manager process does everything the current manager does. This change pulls from apache#3262 and apache#6139. Co-authored-by: Dave Marion <dlmarion@apache.org>

keith-turner · 2026-03-04T00:12:16Z

Worked up a design document at https://cwiki.apache.org/confluence/display/ACCUMULO/Multiple+Managers+Foundation

Pulled most of the text from that into the commit message.

dlmarion

First pass just looking at multiple manager changes, not looking at fate changes.

server/base/src/main/java/org/apache/accumulo/server/AbstractServer.java

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

dlmarion · 2026-03-04T15:06:54Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

+
+    metricsInfo.init(MetricsInfo.serviceTags(getContext().getInstanceName(), getApplicationName(),
+        getAdvertiseAddress(), getResourceGroup()));
+


Do we want to wait here for some minimum set of Managers before proceeding like we do for TabletServers? Wondering if it might reduce some churn in the FateManager at startup. I have the code for this already in #3262 in Manager at line 1057.

That seems like a good change, could be done in a follow on PR

Opened #6186

core/src/main/java/org/apache/accumulo/core/lock/ServiceLockPaths.java

core/src/main/java/org/apache/accumulo/core/rpc/clients/FateWorkerThriftClient.java

dlmarion · 2026-03-04T15:30:32Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

+    metricsInfo
+        .addMetricsProducers(fateWorker.getMetricsProducers().toArray(new MetricsProducer[0]));
+
+    metricsInfo.init(MetricsInfo.serviceTags(getContext().getInstanceName(), getApplicationName(),


Something to address later, how can we add a tag to the metrics to denote whether this Manager is primary or not.

In addition to that, we probably need some command line tools to show information about manager processes. Maybe that could be an update to the service status command. Would be nice to see which manager is primary and for the non primary ones what has been delegated to them. Could also show this information on the monitor.

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

test/src/main/java/org/apache/accumulo/test/MultipleManagerIT.java

…rkerThriftClient.java Co-authored-by: Dave Marion <dlmarion@apache.org>

keith-turner · 2026-03-05T00:29:55Z

Did some testing on a single machine using uno. Found I can start multiple managers with the following command.

ACCUMULO_CLUSTER_ARG=5 accumulo-service manager start

Found one bug during this testing where the fate operation that commits a compaction was not generating a notification. Fixed that in 365b833. Other than that, the testing went well. Running a user compaction from the shell for a table with a single tablet takes 2 seconds. Seems the most time is spent waiting for the compactor to pick up a job. For this simple test saw the following in the logs.

Seeded a fate operation to drive the user compaction. Saw in the logs it notified the remote manager assigned that range of the fate table.
The fate operation started because of the notification and and wrote to the metadata table that the tablet needs compaction. Saw in the logs it notified the TGW via a RPC to the primary manager
The TGW ran because of the notification and saw the metadata entrry and queued a compaction job
Compactor eventually ran the job and seeded a fate operation to commit it. Saw in the logs it notified the remote manager assigned that range of the fate table.
The fate operation to commit ran because of the notification
The fate operation driving the user compaction finished.

core/src/main/java/org/apache/accumulo/core/fate/FateClient.java

dlmarion · 2026-03-05T13:01:53Z

core/src/main/java/org/apache/accumulo/core/lock/ServiceLockPaths.java

    return get(Constants.ZCOMPACTORS, resourceGroupPredicate, address, withLock);
  }

+  public Set<ServiceLockPath> getManagerAssistants(ResourceGroupPredicate resourceGroupPredicate,


In the method above the name is 'AssistantManager' and here it is ManagerAssistants. I think we should be consistent in the naming.

Is it the case that clients will only always connect to the primary manager, so that is why it's called Manager?

Improved the naming consistency in 0556afa

Is it the case that clients will only always connect to the primary manager, so that is why it's called Manager?

Currently clients only talk to the primary manager. In a follow on would like to create a new thrift service, maybe called AssistantManagerClientService, that is not wrapped with HighlyAvailableService and move some things from the current manager client service to it. Could also explore making where the fate thrift client service could be called on any manager. For this change would need to make client side changes similar to what was done in #3262

Speaking of naming, I am going to rename HighlyAvailableService to PrimaryManagerService or something like that. Its naming does not match its purpose very well.

Opened #6183

dlmarion · 2026-03-05T13:07:11Z

core/src/main/java/org/apache/accumulo/core/lock/ServiceLockPaths.java

    return get(Constants.ZCOMPACTORS, resourceGroupPredicate, address, withLock);
  }

+  public Set<ServiceLockPath> getManagerAssistants(ResourceGroupPredicate resourceGroupPredicate,


You should be able to remove the resource group predicate here and use the DEFAULT_RG_ONLY predicate. Or, maybe this method can be removed in favor of the one below.

Changed in 0556afa

dlmarion · 2026-03-05T13:19:24Z

test/src/main/java/org/apache/accumulo/test/ComprehensiveMultiManagerIT.java

+
+    // Start two more managers
+    getCluster().exec(Manager.class);
+    getCluster().exec(Manager.class);


Should this wait until all are up and reporting in ZK before continuing?

Thats a good change, made it in 0556afa

dlmarion · 2026-03-05T13:27:43Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

    return Math.max(1, deadline - System.currentTimeMillis());
  }

+  private void getManagerLock() throws KeeperException, InterruptedException {


Suggested change

private void getManagerLock() throws KeeperException, InterruptedException {

private void getManagerAssistantLock() throws KeeperException, InterruptedException {

Changed in 0556afa

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

dlmarion · 2026-03-05T13:52:49Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

-    var fateCleaner = new FateCleaner<>(store, Duration.ofHours(8), this::getSteadyTime);
-    ThreadPools.watchCriticalScheduledTask(context.getScheduledExecutor()
-        .scheduleWithFixedDelay(fateCleaner::ageOff, 10, 4 * 60, MINUTES));
+      managerMetrics.configureFateMetrics(getConfiguration(), this);


Might be useful to comment in the run method that this method must be called before registering the managerMetrics producer for maintenance purposes.

Made a different change w/ the same goal in d74a34c

Bumped into a few problems w/ the organization of the manager metrics code when making some of these changes. Opened #6181 to improve it. The comment about setup order is one of those problems and that comes from the setup code being spread all over the place.

I also created #6182 because I noticed something metrics related while reviewing this.

server/manager/src/main/java/org/apache/accumulo/manager/fate/FateWorkerEnv.java

Co-authored-by: Dave Marion <dlmarion@apache.org>

…FateWorkerEnv.java Co-authored-by: Dave Marion <dlmarion@apache.org>

keith-turner added this to the 4.0.0 milestone Mar 4, 2026

keith-turner mentioned this pull request Mar 4, 2026

Adds support for multiple managers running distributed fate #6139

Closed

dlmarion reviewed Mar 4, 2026

View reviewed changes

keith-turner and others added 6 commits March 4, 2026 09:29

Update core/src/main/java/org/apache/accumulo/core/rpc/clients/FateWo…

b92fe86

…rkerThriftClient.java Co-authored-by: Dave Marion <dlmarion@apache.org>

code review update

00f5c14

format code

32db7bd

fix compaction seeding notifications

365b833

Merge branch 'main' into dist-fate3

a2408d8

use future return value

20f290d

dlmarion reviewed Mar 5, 2026

View reviewed changes

keith-turner and others added 5 commits March 5, 2026 07:48

Update core/src/main/java/org/apache/accumulo/core/fate/FateClient.java

32f4813

Co-authored-by: Dave Marion <dlmarion@apache.org>

Update server/manager/src/main/java/org/apache/accumulo/manager/fate/…

78cc891

…FateWorkerEnv.java Co-authored-by: Dave Marion <dlmarion@apache.org>

code review update

0556afa

renamed HighlyAvailableService to PrimaryManagerThriftService

80986e2

code review update

d74a34c

dlmarion mentioned this pull request Mar 5, 2026

Ensure Manager process metrics are emitted. #6182

Merged

This was referenced Mar 5, 2026

Move some manager client RPCs to assistant managers #6183

Open

Wait for a minimum set of Managers before delegating management task #6186

Open

Merge branch 'main' into dist-fate3

248297f

keith-turner mentioned this pull request Mar 5, 2026

Display information about multiple managers #6190

Open

keith-turner added 2 commits March 5, 2026 22:54

removed follow on comments

b17391b

narrow exception caught

916dfda


		metricsInfo.init(MetricsInfo.serviceTags(getContext().getInstanceName(), getApplicationName(),
		getAdvertiseAddress(), getResourceGroup()));

	private void getManagerLock() throws KeeperException, InterruptedException {
	private void getManagerAssistantLock() throws KeeperException, InterruptedException {

Conversation

keith-turner commented Mar 4, 2026

Uh oh!

keith-turner commented Mar 4, 2026

Uh oh!

dlmarion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keith-turner commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keith-turner commented Mar 5, 2026 •

edited

Loading