-
Notifications
You must be signed in to change notification settings - Fork 313
Open
Description
System Info
When running text-embeddings-router locally on my Macbook Air M2 I get very different reranking speed depending on the value of --max-batch-requests:
- None: 44s
- 1: 11s
- 8: 30s
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
- cargo install --path router -F metal
- Run with
--max-batch-requests 1
text-embeddings-router \
--model-id BAAI/bge-reranker-v2-m3 \
--port 8001 \
--max-batch-requests 1 \
--max-client-batch-size 80 \
--max-batch-tokens 20480
- Rerank roughly 75 chunks of 256 tokens or lower completes in roughly 11 seconds
2025-09-04T15:40:02.203910Z INFO rerank{total_time="11.258211459s" tokenization_time="24.021326ms" queue_time="5.842368084s" inference_time="149.803875ms"}: text_embeddings_router::http::server: router/src/http/server.rs:471: Success
- Run again but with
--max-batch-requests 8
text-embeddings-router \
--model-id BAAI/bge-reranker-v2-m3 \
--port 8001 \
--max-batch-requests 8 \
--max-client-batch-size 80 \
--max-batch-tokens 20480
- Completes in roughly 30 seconds
2025-09-04T16:51:03.398288Z INFO rerank{total_time="30.35400575s" tokenization_time="174.576542ms" queue_time="13.714441825s" inference_time="2.909703679s"}: text_embeddings_router::http::server: router/src/http/server.rs:471: Success
- Run again but without
--max-batch-requests
text-embeddings-router \
--model-id BAAI/bge-reranker-v2-m3 \
--port 8001 \
--max-client-batch-size 80 \
--max-batch-tokens 20480
- Completes in roughly 44 seconds
2025-09-04T17:17:41.994774Z INFO rerank{total_time="44.397919542s" tokenization_time="144.352587ms" queue_time="3.330799548s" inference_time="39.825660755s"}: text_embeddings_router::http::server: router/src/http/server.rs:471: Success
Complete startup logs:
{"model_id":"BAAI/bge-reranker-v2-m3","model_sha":null,"model_dtype":"float16","model_type":{"reranker":{"id2label":{"0":"LABEL_0"},"label2id":{"LABEL_0":0}}},"max_concurrent_requests":512,"max_input_length":8192,"max_batch_tokens":20480,"max_batch_requests":8,"max_client_batch_size":80,"auto_truncate":false,"tokenization_workers":8,"version":"1.8.0","sha":"8b74ae598e3fa56c91ac6e1db4a59a43fb7ea5bc","docker_label":null}%
2025-09-04T16:45:12.530844Z INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "BAA*/***-********-*2-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 20480, max_batch_requests: Some(8), max_client_batch_size: 80, auto_truncate: false, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8001, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-09-04T16:45:12.534079Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-09-04T16:45:12.534088Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-09-04T16:45:12.728837Z WARN download_artifacts: text_embeddings_core::download: core/src/download.rs:50: Download failed: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-reranker-v2-m3/resolve/main/1_Pooling/config.json)
2025-09-04T16:45:12.728898Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-09-04T16:45:12.856926Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
2025-09-04T16:45:12.988810Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
2025-09-04T16:45:13.110311Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
2025-09-04T16:45:13.237148Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
2025-09-04T16:45:13.360954Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
2025-09-04T16:45:13.485277Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
2025-09-04T16:45:13.609977Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-09-04T16:45:13.727558Z WARN download_artifacts: text_embeddings_core::download: core/src/download.rs:65: Download failed: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/BAAI/bge-reranker-v2-m3/resolve/main/config_sentence_transformers.json)
2025-09-04T16:45:13.727617Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-09-04T16:45:13.727852Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-09-04T16:45:13.727918Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 1.193838708s
2025-09-04T16:45:13.933689Z WARN text_embeddings_router: router/src/lib.rs:190: Could not find a Sentence Transformers config
2025-09-04T16:45:13.933706Z INFO text_embeddings_router: router/src/lib.rs:194: Maximum number of tokens per request: 8192
2025-09-04T16:45:13.933719Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-09-04T16:45:14.703659Z INFO text_embeddings_router: router/src/lib.rs:236: Starting model backend
2025-09-04T16:45:14.703714Z INFO text_embeddings_backend: backends/src/lib.rs:539: Downloading `model.safetensors`
2025-09-04T16:45:14.703793Z INFO text_embeddings_backend: backends/src/lib.rs:407: Model weights downloaded in 79.042µs
2025-09-04T16:45:14.703814Z INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:638: Downloading `modules.json`
2025-09-04T16:45:14.822713Z INFO text_embeddings_backend: backends/src/lib.rs:419: Dense modules downloaded in 118.914041ms
2025-09-04T16:45:14.842187Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:255: Starting Bert model on Metal(MetalDevice(DeviceId(1)))
2025-09-04T16:45:18.181124Z INFO text_embeddings_router: router/src/lib.rs:254: Warming up model
2025-09-04T16:50:07.896396Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:8001
2025-09-04T16:50:07.896594Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready
Expected behavior
Speed is similar or the same. The default should not be the slowest setup.
Metadata
Metadata
Assignees
Labels
No labels