Skip to content

Severely degraded reranking speed depending on --max-batch-requests #713

@pandysp

Description

@pandysp

System Info

When running text-embeddings-router locally on my Macbook Air M2 I get very different reranking speed depending on the value of --max-batch-requests:

  • None: 44s
  • 1: 11s
  • 8: 30s

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. cargo install --path router -F metal
  2. Run with --max-batch-requests 1
text-embeddings-router \
    --model-id BAAI/bge-reranker-v2-m3 \
    --port 8001 \
    --max-batch-requests 1 \
    --max-client-batch-size 80 \
    --max-batch-tokens 20480
  1. Rerank roughly 75 chunks of 256 tokens or lower completes in roughly 11 seconds
2025-09-04T15:40:02.203910Z  INFO rerank{total_time="11.258211459s" tokenization_time="24.021326ms" queue_time="5.842368084s" inference_time="149.803875ms"}: text_embeddings_router::http::server: router/src/http/server.rs:471: Success
  1. Run again but with --max-batch-requests 8
text-embeddings-router \
    --model-id BAAI/bge-reranker-v2-m3 \
    --port 8001 \
    --max-batch-requests 8 \
    --max-client-batch-size 80 \
    --max-batch-tokens 20480
  1. Completes in roughly 30 seconds
2025-09-04T16:51:03.398288Z  INFO rerank{total_time="30.35400575s" tokenization_time="174.576542ms" queue_time="13.714441825s" inference_time="2.909703679s"}: text_embeddings_router::http::server: router/src/http/server.rs:471: Success
  1. Run again but without --max-batch-requests
text-embeddings-router \
    --model-id BAAI/bge-reranker-v2-m3 \
    --port 8001 \
    --max-client-batch-size 80 \
    --max-batch-tokens 20480
  1. Completes in roughly 44 seconds
2025-09-04T17:17:41.994774Z  INFO rerank{total_time="44.397919542s" tokenization_time="144.352587ms" queue_time="3.330799548s" inference_time="39.825660755s"}: text_embeddings_router::http::server: router/src/http/server.rs:471: Success

Complete startup logs:

{"model_id":"BAAI/bge-reranker-v2-m3","model_sha":null,"model_dtype":"float16","model_type":{"reranker":{"id2label":{"0":"LABEL_0"},"label2id":{"LABEL_0":0}}},"max_concurrent_requests":512,"max_input_length":8192,"max_batch_tokens":20480,"max_batch_requests":8,"max_client_batch_size":80,"auto_truncate":false,"tokenization_workers":8,"version":"1.8.0","sha":"8b74ae598e3fa56c91ac6e1db4a59a43fb7ea5bc","docker_label":null}%      
2025-09-04T16:45:12.530844Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "BAA*/***-********-*2-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 20480, max_batch_requests: Some(8), max_client_batch_size: 80, auto_truncate: false, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8001, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-09-04T16:45:12.534079Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-09-04T16:45:12.534088Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-09-04T16:45:12.728837Z  WARN download_artifacts: text_embeddings_core::download: core/src/download.rs:50: Download failed: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-reranker-v2-m3/resolve/main/1_Pooling/config.json)
2025-09-04T16:45:12.728898Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-09-04T16:45:12.856926Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
2025-09-04T16:45:12.988810Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
2025-09-04T16:45:13.110311Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
2025-09-04T16:45:13.237148Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
2025-09-04T16:45:13.360954Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
2025-09-04T16:45:13.485277Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
2025-09-04T16:45:13.609977Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-09-04T16:45:13.727558Z  WARN download_artifacts: text_embeddings_core::download: core/src/download.rs:65: Download failed: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-reranker-v2-m3/resolve/main/config_sentence_transformers.json)
2025-09-04T16:45:13.727617Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-09-04T16:45:13.727852Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-09-04T16:45:13.727918Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 1.193838708s
2025-09-04T16:45:13.933689Z  WARN text_embeddings_router: router/src/lib.rs:190: Could not find a Sentence Transformers config
2025-09-04T16:45:13.933706Z  INFO text_embeddings_router: router/src/lib.rs:194: Maximum number of tokens per request: 8192
2025-09-04T16:45:13.933719Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-09-04T16:45:14.703659Z  INFO text_embeddings_router: router/src/lib.rs:236: Starting model backend
2025-09-04T16:45:14.703714Z  INFO text_embeddings_backend: backends/src/lib.rs:539: Downloading `model.safetensors`
2025-09-04T16:45:14.703793Z  INFO text_embeddings_backend: backends/src/lib.rs:407: Model weights downloaded in 79.042µs
2025-09-04T16:45:14.703814Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:638: Downloading `modules.json`
2025-09-04T16:45:14.822713Z  INFO text_embeddings_backend: backends/src/lib.rs:419: Dense modules downloaded in 118.914041ms
2025-09-04T16:45:14.842187Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:255: Starting Bert model on Metal(MetalDevice(DeviceId(1)))
2025-09-04T16:45:18.181124Z  INFO text_embeddings_router: router/src/lib.rs:254: Warming up model
2025-09-04T16:50:07.896396Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:8001
2025-09-04T16:50:07.896594Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready

Expected behavior

Speed is similar or the same. The default should not be the slowest setup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions