-
Notifications
You must be signed in to change notification settings - Fork 467
Description
Summary
AIBrix has been successfully adopted as a cloud-native orchestration framework for LLM inference engines (e.g., vLLM, SGLang). To expand its scope into multimodal AI inference, new capabilities are required across the gateway, engine integration, data handling, caching, metrics, and pre/post-processing layers. This RFC motivates and details the prioritized tasks needed to extend AIBrix to fully support multimodal scenarios.
Motivation
By extending AIBrix’s architecture in these ways, we could enable AIBrix to be a unified inference orchestration platform—handling not just text LLMs, but richer, multimodal workloads—while retaining its strengths in routing, cost-efficiency, and scalability.
Proposed Change
1. Multimodal Engine Integration
- Many multimodal models (e.g., Qwen-VL, LLaVA, GPT-4o) accept text + image input and return text output.
- OpenAI’s /chat/completions and /completions APIs already support mixed input payloads ({"type": "text"} + {"type": "image_url"}), which are widely adopted.
TODO
- Verify AIBrix API Gateway to accept different forms of modality input. Fully support Open AI API through vLLM. Refer to examples in https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_client_for_multimodal.py for further details
- image/video url
- image/video/audio encoding
- raw data
- Deliver a demo deployment with a text+image model (e.g., Qwen-VL, ) running through AIBrix.
- Single image + text
- Multi images + text
- Video
- Audio
- Verify AIBrix's support for embedding input. Refer to examples in https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_embedding_client_for_multimodal.py for further details.
- Support embedding endpoint
2. Engine Adaptation Beyond LLMs
- Current AIBrix engine integrations (vLLM, SGLang) assume an OpenAI-compatible LLM interface.
- Diffusion models (e.g., xDiT, https://github.com/xdit-project/xDiT) expose non-OpenAI APIs with different input/output formats.
- Supporting such engines would expand AIBrix beyond text and vision-language models into generative image/video/audio domains.
TODO
- Evaluate candidate API standards: [RFC]: xDiT Video Generation API #1595
- Provide a pluggable API gateway layer to support alternative APIs when required.
- Define a capability registry per engine (supports: [text, image, audio], API type). [RFC]: xDiT Video Generation API #1595
- Prototype integration with multi-modality engine (e.g., xDiT) to test the gateway + orchestration path.
- Deliver a demo deployment with video generation model: https://huggingface.co/Wan-AI
3. Large File Download / Access Support
- Multimodal models often need to handle large input files (video/audio) or large outputs (image generations, video clips).
- Current API payloads are insufficient for multi-100MB content.
- Industry practice (e.g., OpenAI file uploads, AWS Bedrock) relies on remote storage integration (S3/GCS/Azure Blob).
TODO
- Extend AIBrix Gateway to:
- Support file references (URIs) instead of inline payloads.
- Integrate with cloud storage (e.g., S3) -compatible object storage for upload/download.
- Define policies for temporary storage, caching, and cleanup of large multimodal assets.
- Add data movement monitoring to avoid bottlenecks.
4. KV Cache and Beyond
- AIBrix currently implements a distributed KV cache to accelerate LLM inference by storing attention states.
- LMCache added support for identical images so that image reuse could be identified (and prevent different images to reused the same KV due to the use of placeholder tokens)
- (Beyond KV cache) In multimodal models, cacheable elements may include:
- Visual embeddings (CLIP features).
- Latent representations for image/video/audio.
- Extending caching semantics will improve throughput and latency for repeated multimodal contexts.
TODO
- Validate and adding support for AIBrix to identify image/video/audio reuse.
- Benchmark benefits for common workloads (e.g., repeated image captioning over the same video frames).
- Extend cache API to allow multimodal engines to store/retrieve intermediate representations.
- See some techniques mentioned here: https://github.com/xdit-project/xDiT?tab=readme-ov-file#cache-acceleration
- Generalize KV cache service to support arbitrary embedding/latent types.
5. Modality-Specific Metrics
- Current AIBrix metrics (e.g., tokens/sec, request latency) are LLM-centric.
- Multimodal workloads require new observability dimensions:
- Image/video size processed (pixels/frame count).
- Audio duration.
- Preprocessing latency (I/O, decoding, embedding).
- These metrics are crucial for autoscaling and SLA enforcement.
TODO
- Extend telemetry pipeline with modality-aware metrics exporters.
- Define new autoscaling triggers, examples:
- Scale on average pixels/sec.
- Scale on average audio seconds/sec.
- Update dashboards to visualize cross-modality utilization trends.
6. Pre/Post Processing Pipelines
- Multimodal inference typically requires non-trivial preprocessing and postprocessing:
- Image decoding/normalization, resizing.
- Audio resampling or spectrogram generation.
- Video frame extraction.
- Postprocessing for outputs (e.g., base64→PNG, WAV encoding).
vLLM's separation of preprocessing (e.g., encoding) pipeline separation is under development
-
[RFC]: Prototype Separating Vision Encoder to Its Own Worker vllm#20799
-
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation vllm#21740
-
Three ways of servings:
- Frozen LLM + External Encoder (Encoder → projector → embeddings as prefix tokens to frozen LLM)
- Unified Discrete Tokenization (Non-text → discrete tokens in shared vocab; train LLM jointly)
- Orchestration / Tool Use (LLM calls external encoder services → text results fed back in)- Centralizing these within AIBrix prevents duplication across engines and enables standardized observability and scaling.
TODO
- Introduce pre/post-processing sidecar components or pipeline stages.
- Allow pluggable user-defined preprocessing modules for custom use cases.
Alternatives Considered
None