Skip to content

Multiple riak_repl2_fscoordinator_sup workers per cluster when adding nodes [JIRA: RIAK-2675] #748

@nerophon

Description

@nerophon

A customer finds multiple fullsync coordinator workers running simultaneously on each of two clusters. This causes multiple fullsync schedules to run concurrently; the actual fullsync operations may or may not overlap, but each coordinator is active and has its own timer.

This state is reproducible as follows:

  1. Set up two clusters, A & B.
  2. Set up REPL and connect them (cluster manager 0.0.0.0:9080).
  3. Set fullsync_on_connect to true (unclear whether this step is required).
  4. Push continuous load onto cluster A.
  5. Start fullsync with A as source and B as sink.
  6. While fullsync is running, join one or more new nodes to A.
  7. On all nodes riak attach and run supervisor:count_children(whereis(riak_repl2_fscoordinator_sup))..
  8. Observe that worker count > 0 on more than one node. In my test, it was on the original coordinator and also the newly joined node.

The workaround for this issue is to manually kill all riak_repl2_fscoordinator_sup processes as follows:

  1. stop & disable fullsync
  2. wait a few minutes
  3. on each node attach and run: Pid = whereis(riak_repl2_fscoordinator_sup). then erlang:exit(Pid,kill)..
  4. wait a few minutes
  5. enable & start fullsync

The symptoms of this issue are extremely slow fullsync operations, cluster overload / slowness, and fullsync activity in the logs when no fullsync ought to be running.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions