Implement Request and Response Policy Based Routing in Cluster Mode #3422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ofekshenawa wants to merge 24 commits into redis:load-balance-search-commands-to-shards from ofekshenawa:load-balance-search-commands-to-shards

Collaborator

ofekshenawa commented Jun 30, 2025 •

edited

Loading

This PR introduces support for Redis COMMAND-based request_policy and response_policy routing for Redis commands when used in OSS Cluster client.

Key Additions:

Command Policy Loader: Parses and caches COMMAND metadata with routing/aggregation tips on first use.
Routing Engine Enhancements:
Implements support for all request policies: default(keyless), default(hashslot), all_shards, all_nodes, multi_shard, and special.
Response Aggregator: Combines multi-shard replies based on response_policy:
all_succeeded, one_succeeded, agg_sum, special, etc.
Includes custom handling for special commands like FT.CURSOR.
Raw Command Support: Policies are enforced on Client.Do(ctx, args...).

ofekshenawa and others added 7 commits

May 14, 2025 21:35


          feat(routing): add internal request/response policy enums

82a3433


          Merge pull request #3 from ofekshenawa/define-policy-type

9e4369a

feat(routing): add internal request/response policy enums


          feat: load the policy table in cluster client (#4)

74407a0

* feat: load the policy table in cluster client

* Remove comments


          modify Tips and command pplicy in commandInfo (#5)

f99c63b


          centralize cluster command routing in osscluster_router.go and refact…

b6633bf

…or osscluster.go (#6)

* centralize cluster command routing in osscluster_router.go and refactor osscluster.go

* enalbe ci on all branches

* Add debug prints

* Add debug prints

* FIX: deal with nil policy

* FIX: fixing clusterClient process

* chore(osscluster): simplify switch case

* wip(command): ai generated clone method for commands

* feat: implement response aggregator for Redis cluster commands

* feat: implement response aggregator for Redis cluster commands

* fix: solve concurrency errors

* fix: solve concurrency errors

* return MaxRedirects settings

* remove locks from getCommandPolicy

* Handle MOVED errors more robustly, remove cluster reloading at exectutions, ennsure better routing

* Fix: supports Process hook test

* Fix: remove response aggregation for single shard commands

* Add more preformant type conversion for Cmd type

* Add router logic into processPipeline

---------

Co-authored-by: Nedyalko Dyakov <[email protected]>


          Merge branch 'load-balance-search-commands-to-shards' into load-balan…

ed528f8

…ce-search-commands-to-shards


          remove thread debugging code

43fcc67

ofekshenawa changed the title ~~Load balance search commands to shards~~ Implement Request and Response Policy Based Routing in Cluster Mode

ofekshenawa added 4 commits

July 4, 2025 16:05


          remove thread debugging code && reject commands with policy that cann…

7eb3818

…ot be used in pipeline


          refactor processPipline and cmdType enum

de344fd


          remove FDescribe from cluster tests

57cdd32


          Add tests

04a110a

ofekshenawa requested review from htemelski-redis, bobymicroby and ndyakov

July 6, 2025 10:28

ofekshenawa added 4 commits

July 6, 2025 14:44


          fix aggregation test

f1c7f62


          fix mget test

e0b122a


          fix mget test

a2ffd62


          remove aggregateKeyedResponses

c00bd81

ofekshenawa marked this pull request as ready for review

July 6, 2025 12:54

htemelski-redis requested changes

View reviewed changes

osscluster_router.go

    
              		}

              		if result.cmd != nil && result.err == nil {

              			// For MGET, extract individual values from the array result

              			if strings.ToLower(cmd.Name()) == "mget" {

Contributor

htemelski-redis Sep 25, 2025

Do we actually need this special case?

osscluster_router.go

    
              }

              // getCommandPolicy retrieves the routing policy for a command

              func (c *ClusterClient) getCommandPolicy(ctx context.Context, cmd Cmder) *routing.CommandPolicy {

Contributor

htemelski-redis Sep 25, 2025 •

edited

Loading

~~It seems like this will introduce a big overhead for each command execution.~~
We should fetch all policies during the connection handshake

Contributor

htemelski-redis Sep 25, 2025

Note: for the first stage we should use hard-coded policy manager that can be extended in the future to take into account the COMMAND command output

Member

bobymicroby Sep 25, 2025

@htemelski-redis 💡 Consider implementing a PolicyResolverConfig type that users can override via the client options. This config should map module__command_name to metadata (policies, key requirements, etc.).

Set hardcoded defaults in the client options, but allow users to override policies per command as needed.

htemelski-redis added 2 commits

October 8, 2025 09:24


          added scaffolding for the req-resp manager

de1b16c


          added default policies for the search commands

1b2eaa6

htemelski-redis force-pushed the load-balance-search-commands-to-shards branch from 6e3b627 to 1b2eaa6 Compare

October 8, 2025 08:05

htemelski-redis added 2 commits

October 8, 2025 14:50


          split command map into module->command

64245f8


          cleanup, added logic to refresh the cache

3397b6f

ndyakov reviewed

View reviewed changes

.github/workflows/build.yml Show resolved Hide resolved

htemelski-redis added 3 commits

October 9, 2025 12:24


          added reactive cache refresh

4fb4c68


          revert cluster refresh

bd526a8


          fixed lint

5b01de5

htemelski-redis added 2 commits

October 9, 2025 13:43


          updated build workflow

4d1d775


          update build action

2a06726

ndyakov reviewed

View reviewed changes

Member

ndyakov left a comment

Submitting partial review for the aggregators.

internal/routing/aggregator.go Show resolved Hide resolved

osscluster_router.go

Comment on lines +446 to +449

    
              	// For MGET without policy, use keyed aggregator

              	if cmdName == "mget" {

              		return routing.NewDefaultAggregator(true)

              	}

Member

ndyakov Oct 9, 2025

Since we are passing the cmd.Name() in routing.NewResponseAggregator this can be handler by it. If policy is nil for mget, maybe the NewResponseAggregator can accept a policy and check the nil as well`.

internal/routing/aggregator.go

Comment on lines +68 to +79

    
              	a.mu.Lock()

              	defer a.mu.Unlock()

              	if err != nil && a.firstErr == nil {

              		a.firstErr = err

              		return nil

              	}

              	if err == nil && !a.hasResult {

              		a.result = result

              		a.hasResult = true

              	}

              	return nil

Member

ndyakov Oct 9, 2025

Couple of questions here:

Should we return only the first observed error?
Why are we overwriting the result?
Can't we just have an atomic boolean hasError?
Same, if we can have atomic hasResult we can drop the mutex.

My questions and my idea is that if we are going to return on the first error, we can do this with atomics and skip the cpu cycle for the mutex.

Contributor

htemelski-redis Oct 9, 2025

For the all succeed policy, we either return one of the replies if there is no error, or one of the errors if there's at least one
So

Yes, returning only the first error is sufficient
We are setting it only once
3/4. I feel that using atomics will overcomplicate the aggregators, plus there are some caveats to using them. I think we should try to maximize the compatibility of the library

internal/routing/aggregator.go

Comment on lines +105 to +118

    
              func (a *OneSucceededAggregator) Add(result interface{}, err error) error {

              	a.mu.Lock()

              	defer a.mu.Unlock()

              	if err != nil && a.firstErr == nil {

              		a.firstErr = err

              		return nil

              	}

              	if err == nil && !a.hasResult {

              		a.result = result

              		a.hasResult = true

              	}

              	return nil

              }

Member

ndyakov Oct 9, 2025

Same as with AllSucceededAggregator. Maybe we can use atomics here.

Contributor

htemelski-redis Oct 9, 2025

Same as above

internal/routing/aggregator.go

    
              			return nil

              		}

              		if err == nil {

              			a.sum += val

Member

ndyakov Oct 9, 2025

Again, maybe we can use atomic.Int64

Contributor

htemelski-redis Oct 9, 2025

-||-

internal/routing/aggregator.go

    
              }

              // AggMinAggregator returns the minimum numeric value from all shards.

              type AggMinAggregator struct {

Member

ndyakov Oct 9, 2025

Looking at https://github.com/haraldrudell/parl , there is atomic min and atomic max implementations that we can also use.

Member

ndyakov Oct 9, 2025

p.s. I do suggest copying only the needed implementation or using it as reference to reimplement, not including the whole dependency. Of course, mentioning the creator in the code.

Contributor

htemelski-redis Oct 9, 2025

-||-

internal/routing/aggregator.go

    
              		return nil, a.firstErr

              	}

              	if !a.hasResult {

              		return nil, fmt.Errorf("redis: no valid results to aggregate for min operation")

Member

ndyakov Oct 9, 2025

Can we extract such errors in a separate file?

internal/routing/aggregator.go

Comment on lines +548 to +565

    
              func (a *SpecialAggregator) Finish() (interface{}, error) {

              	a.mu.Lock()

              	defer a.mu.Unlock()

              	if a.aggregatorFunc != nil {

              		return a.aggregatorFunc(a.results, a.errors)

              	}

              	// Default behavior: return first non-error result or first error

              	for i, err := range a.errors {

              		if err == nil {

              			return a.results[i], nil

              		}

              	}

              	if len(a.errors) > 0 {

              		return nil, a.errors[0]

              	}

              	return nil, nil

              }

Member

ndyakov Oct 9, 2025

I do think we should be able to control the priority here. I assume for some commands the errors will be with higher priority, for others - the results.

Contributor

htemelski-redis Oct 9, 2025

Wouldn't this be achieved using the aggregatorFunc?

internal/routing/aggregator.go

Comment on lines +567 to +588

    
              // SetAggregatorFunc allows setting custom aggregation logic for special commands.

              func (a *SpecialAggregator) SetAggregatorFunc(fn func([]interface{}, []error) (interface{}, error)) {

              	a.mu.Lock()

              	defer a.mu.Unlock()

              	a.aggregatorFunc = fn

              }

              // SpecialAggregatorRegistry holds custom aggregation functions for specific commands.

              var SpecialAggregatorRegistry = make(map[string]func([]interface{}, []error) (interface{}, error))

              // RegisterSpecialAggregator registers a custom aggregation function for a command.

              func RegisterSpecialAggregator(cmdName string, fn func([]interface{}, []error) (interface{}, error)) {

              	SpecialAggregatorRegistry[cmdName] = fn

              }

              // NewSpecialAggregator creates a special aggregator with command-specific logic if available.

              func NewSpecialAggregator(cmdName string) *SpecialAggregator {

              	agg := &SpecialAggregator{}

              	if fn, exists := SpecialAggregatorRegistry[cmdName]; exists {

              		agg.SetAggregatorFunc(fn)

              	}

              	return agg

Member

ndyakov Oct 9, 2025

SetAggregatorFunc is only used internally in this package, I assume it can be private if needed at all, see next comment.

internal/routing/aggregator.go

Comment on lines +583 to +588

    
              func NewSpecialAggregator(cmdName string) *SpecialAggregator {

              	agg := &SpecialAggregator{}

              	if fn, exists := SpecialAggregatorRegistry[cmdName]; exists {

              		agg.SetAggregatorFunc(fn)

              	}

              	return agg

Member

ndyakov Oct 9, 2025

Suggested change

      
            func NewSpecialAggregator(cmdName string) *SpecialAggregator {
          
            	agg := &SpecialAggregator{}
          
            	if fn, exists := SpecialAggregatorRegistry[cmdName]; exists {
          
            		agg.SetAggregatorFunc(fn)
          
            	}
          
            	return agg
          
            func NewSpecialAggregator(cmdName string) *SpecialAggregator {
          
                fn := SpecialAggregatorRegistry[cmdName]
          
            	return &SpecialAggregator{aggregatorFunc: fn}

I do think this should be doable and we are going to remove the need for SetAggregatorFunc and therefore - locking and unlocking the mutex.

ndyakov reviewed

View reviewed changes

Member

ndyakov left a comment

Submitting another partial review.

internal/routing/policy.go

    
              }

              func (p *CommandPolicy) CanBeUsedInPipeline() bool {

              	return p.Request != ReqAllNodes && p.Request != ReqAllShards && p.Request != ReqMultiShard

Member

ndyakov Oct 9, 2025

What about special? Can it be used in a pipeline?

internal/routing/shard_picker.go

Comment on lines +8 to +12

    
              // ShardPicker chooses “one arbitrary shard” when the request_policy is

              // ReqDefault and the command has no keys.

              type ShardPicker interface {

              	Next(total int) int // returns an index in [0,total)

              }

Member

ndyakov Oct 9, 2025

Those are great, can we implement StaticShardPicker or StickyShardPicker that will always return the same shard. I do think this can be helpful for testing. This is not a blocker by any means.

command.go Show resolved Hide resolved

command.go

Comment on lines -879 to +1073

    
              	return strconv.ParseBool(cmd.val)

              	return strconv.ParseBool(cmd.Val())

Member

ndyakov Oct 9, 2025

why was this change needed?

command.go Show resolved Hide resolved

command.go

Comment on lines +4396 to +4414

    
              	if commandInfoTips != nil {

              		if v, ok := commandInfoTips[requestPolicy]; ok {

              			if p, err := routing.ParseRequestPolicy(v); err == nil {

              				req = p

              			}

              		}

              		if v, ok := commandInfoTips[responsePolicy]; ok {

              			if p, err := routing.ParseResponsePolicy(v); err == nil {

              				resp = p

              			}

              		}

              	}

              	tips := make(map[string]string, len(commandInfoTips))

              	for k, v := range commandInfoTips {

              		if k == requestPolicy || k == responsePolicy {

              			continue

              		}

              		tips[k] = v

              	}

Member

ndyakov Oct 9, 2025

can't we do both of those in a single range over commandInfoTips?

command.go

Comment on lines +6840 to +6841

    
              // ExtractCommandValue extracts the value from a command result using the fast enum-based approach

              func ExtractCommandValue(cmd interface{}) interface{} {

Member

ndyakov Oct 9, 2025

I assume all cases (types) for which interface{ Val() interface{} } is used for extracting the value can be combined together?

command.go

    
              	return nil

              }

              func (cmd *MonitorCmd) Clone() Cmder {

Member

ndyakov Oct 9, 2025

let's move this above the ExtractCommandValue function

json.go

    
              	return nil

              }

              func (cmd *IntPointerSliceCmd) Clone() Cmder {

Member

ndyakov Oct 9, 2025

it's tricky here. do we need to return the same pointer or do we only want the value when cloning?

osscluster.go

Comment on lines +1864 to +1868

    
              // cmdInfo will fetch and cache the command policies after the first execution

              func (c *ClusterClient) cmdInfo(ctx context.Context, name string) *CommandInfo {

              	cmdsInfo, err := c.cmdsInfoCache.Get(ctx)

              	// Use a separate context that won't be canceled to ensure command info lookup

              	// doesn't fail due to original context cancellation

              	cmdInfoCtx := context.Background()

Member

ndyakov Oct 9, 2025

most of the time the cmdInfo should be cached already, why don't we just use the c.context(ctx) to determine if the original one (with it's timeout) be used or a Background context will be used when c.opt.ContextTimeoutEnabled is false.

ndyakov reviewed

View reviewed changes

Member

ndyakov left a comment

Final part of initial review

Overview:

Let's use atomics when possible.
Left questions related to the node selection and setting of values.

Overall the design of the solution looks good, would have to do an additional pass over the test files once this review is addressed.

Thank you both @ofekshenawa and @htemelski-redis!

osscluster_router.go

Comment on lines +23 to +38

    
              func (c *ClusterClient) routeAndRun(ctx context.Context, cmd Cmder, node *clusterNode) error {

              	policy := c.getCommandPolicy(ctx, cmd)

              	switch {

              	case policy != nil && policy.Request == routing.ReqAllNodes:

              		return c.executeOnAllNodes(ctx, cmd, policy)

              	case policy != nil && policy.Request == routing.ReqAllShards:

              		return c.executeOnAllShards(ctx, cmd, policy)

              	case policy != nil && policy.Request == routing.ReqMultiShard:

              		return c.executeMultiShard(ctx, cmd, policy)

              	case policy != nil && policy.Request == routing.ReqSpecial:

              		return c.executeSpecialCommand(ctx, cmd, policy, node)

              	default:

              		return c.executeDefault(ctx, cmd, node)

              	}

              }

Member

ndyakov Oct 9, 2025

Suggested change

      
            func (c *ClusterClient) routeAndRun(ctx context.Context, cmd Cmder, node *clusterNode) error {
          
            	policy := c.getCommandPolicy(ctx, cmd)
          
            	switch {
          
            	case policy != nil && policy.Request == routing.ReqAllNodes:
          
            		return c.executeOnAllNodes(ctx, cmd, policy)
          
            	case policy != nil && policy.Request == routing.ReqAllShards:
          
            		return c.executeOnAllShards(ctx, cmd, policy)
          
            	case policy != nil && policy.Request == routing.ReqMultiShard:
          
            		return c.executeMultiShard(ctx, cmd, policy)
          
            	case policy != nil && policy.Request == routing.ReqSpecial:
          
            		return c.executeSpecialCommand(ctx, cmd, policy, node)
          
            	default:
          
            		return c.executeDefault(ctx, cmd, node)
          
            	}
          
            }
          
            func (c *ClusterClient) routeAndRun(ctx context.Context, cmd Cmder, node *clusterNode) error {
          
            	policy := c.getCommandPolicy(ctx, cmd)
          
                if policy == nil {
          
                    return c.executeDefault(ctx, cmd, node)
          
                }
          
            	switch policy.Request {
          
            	case routing.ReqAllNodes:
          
            		return c.executeOnAllNodes(ctx, cmd, policy)
          
            	case routing.ReqAllShards:
          
            		return c.executeOnAllShards(ctx, cmd, policy)
          
            	case routing.ReqMultiShard:
          
            		return c.executeMultiShard(ctx, cmd, policy)
          
            	case routing.ReqSpecial:
          
            		return c.executeSpecialCommand(ctx, cmd, policy, node)
          
            	default:
          
            		return c.executeDefault(ctx, cmd, node)
          
            	}
          
            }

osscluster_router.go

Comment on lines +50 to +53

    
              	if c.hasKeys(cmd) {

              		// execute on key based shard

              		return node.Client.Process(ctx, cmd)

              	}

Member

ndyakov Oct 9, 2025

Do we know that this node servers the slot for the key?

Contributor

htemelski-redis Oct 9, 2025

Yes, the node should've been selected based on the slot osscluster.go:L1906

func (c *ClusterClient) cmdNode(

osscluster_router.go

    
              		// execute on key based shard

              		return node.Client.Process(ctx, cmd)

              	}

              	return c.executeOnArbitraryShard(ctx, cmd)

Member

ndyakov Oct 9, 2025

since it doesn't matter and there is already some node selected, why not use it?

Contributor

htemelski-redis Oct 9, 2025

We have two different ways of picking an arbitrary shard, either round robin or a random one

osscluster_router.go

    
              	case CmdTypeKeyFlags:

              		return NewKeyFlagsCmd(ctx, args...)

              	case CmdTypeDuration:

              		return NewDurationCmd(ctx, time.Second, args...)

Member

ndyakov Oct 9, 2025

Some CmdTypeDuration commands do use time.Milisecond as precision, see PExpireTime for example. Shouldn't we use it here so we don't lose precision?

osscluster_router.go

Comment on lines +498 to +500

    
              			// Command executed successfully but value extraction failed

              			// This is common for complex commands like CLUSTER SLOTS

              			// The command already has its result set correctly, so just return

Member

ndyakov Oct 9, 2025

I do not understand that comment here. Why the value extraction returned nil? Can we make sure the cmd has value set at least? If it doesn't, we may return a cmd with nil value and nil error, which doesn't make sense.

osscluster_router.go

Comment on lines +748 to +759

    
              		if c, ok := cmd.(*KeyValuesCmd); ok {

              			// KeyValuesCmd needs a key string and values slice

              			if key, ok := value.(string); ok {

              				c.SetVal(key, []string{}) // Default empty values

              			}

              		}

              	case CmdTypeZSliceWithKey:

              		if c, ok := cmd.(*ZSliceWithKeyCmd); ok {

              			// ZSliceWithKeyCmd needs a key string and Z slice

              			if key, ok := value.(string); ok {

              				c.SetVal(key, []Z{}) // Default empty Z slice

              			}

Member

ndyakov Oct 9, 2025

why are we setting empty values here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet