Multiple riak_repl2_fscoordinator_sup workers per cluster when adding nodes [JIRA: RIAK-2675]

A customer finds multiple fullsync coordinator workers running simultaneously on each of two clusters. This causes multiple fullsync schedules to run concurrently; the actual fullsync operations may or may not overlap, but each coordinator is active and has its own timer.

This state is reproducible as follows:
1. Set up two clusters, A & B.
2. Set up REPL and connect them (cluster manager 0.0.0.0:9080).
3. Set `fullsync_on_connect` to `true` (unclear whether this step is required).
4. Push continuous load onto cluster A.
5. Start fullsync with A as source and B as sink.
6. While fullsync is running, join one or more new nodes to A.
7. On all nodes `riak attach` and run `supervisor:count_children(whereis(riak_repl2_fscoordinator_sup)).`.
8. Observe that worker count > 0 on more than one node. In my test, it was on the original coordinator and also the newly joined node.

The workaround for this issue is to manually kill all `riak_repl2_fscoordinator_sup` processes as follows:
1. stop & disable fullsync
2. wait a few minutes
3. on each node attach and run: `Pid = whereis(riak_repl2_fscoordinator_sup).` then `erlang:exit(Pid,kill).`.
4. wait a few minutes
5. enable & start fullsync

The symptoms of this issue are extremely slow fullsync operations, cluster overload / slowness, and fullsync activity in the logs when no fullsync ought to be running.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple riak_repl2_fscoordinator_sup workers per cluster when adding nodes [JIRA: RIAK-2675] #748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiple riak_repl2_fscoordinator_sup workers per cluster when adding nodes [JIRA: RIAK-2675] #748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions