[RFC]: Supporting multi-modality models in AIBrix

### Summary

AIBrix has been successfully adopted as a cloud-native orchestration framework for LLM inference engines (e.g., vLLM, SGLang). To expand its scope into multimodal AI inference, new capabilities are required across the gateway, engine integration, data handling, caching, metrics, and pre/post-processing layers. This RFC motivates and details the prioritized tasks needed to extend AIBrix to fully support multimodal scenarios.

### Motivation
By extending AIBrix’s architecture in these ways, we could enable AIBrix to be a unified inference orchestration platform—handling not just text LLMs, but richer, multimodal workloads—while retaining its strengths in routing, cost-efficiency, and scalability.


### Proposed Change

#### 1. Multimodal Engine Integration
- Many multimodal models (e.g., Qwen-VL, LLaVA, GPT-4o) accept text + image input and return text output.
- OpenAI’s /chat/completions and /completions APIs already support mixed input payloads ({"type": "text"} + {"type": "image_url"}), which are widely adopted.
##### TODO 
- [x] Verify AIBrix API Gateway to accept different forms of modality input. Fully support Open AI API through vLLM. Refer to examples in https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_client_for_multimodal.py for further details
    -   image/video url
    -   image/video/audio encoding
    -   raw data
- [x] Deliver a demo deployment with a text+image model (e.g., Qwen-VL, ) running through AIBrix.
    -  Single image + text
    -  Multi images  + text
    -  Video
    -  Audio
- [x] Verify AIBrix's support for embedding input. Refer to examples in https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_embedding_client_for_multimodal.py for further details. 
    - https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#embeddings-api_1
    - https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#transcriptions-api_1
- [x] Support embedding endpoint 
    - #1208 @varungup90 

#### 2. Engine Adaptation Beyond LLMs
- Current AIBrix engine integrations (vLLM, SGLang) assume an OpenAI-compatible LLM interface.
- Diffusion models (e.g., xDiT, https://github.com/xdit-project/xDiT) expose non-OpenAI APIs with different input/output formats.
- Supporting such engines would expand AIBrix beyond text and vision-language models into generative image/video/audio domains.
##### TODO 
- [x] Evaluate candidate API standards: https://github.com/vllm-project/aibrix/issues/1595
    - Provide a pluggable API gateway layer to support alternative APIs when required.
- [x] Define a capability registry per engine (supports: [text, image, audio], API type). https://github.com/vllm-project/aibrix/issues/1595
- [ ] Prototype integration with multi-modality engine (e.g., xDiT) to test the gateway + orchestration path.
    - Deliver a demo deployment with video generation model: https://huggingface.co/Wan-AI

#### 3. Large File Download / Access Support
- Multimodal models often need to handle large input files (video/audio) or large outputs (image generations, video clips).
- Current API payloads are insufficient for multi-100MB content.
- Industry practice (e.g., OpenAI file uploads, AWS Bedrock) relies on remote storage integration (S3/GCS/Azure Blob).
##### TODO 
- [ ] Extend AIBrix Gateway to:
  - Support file references (URIs) instead of inline payloads.
  - Integrate with cloud storage (e.g., S3) -compatible object storage for upload/download.
- [ ] Define policies for temporary storage, caching, and cleanup of large multimodal assets.
- [ ] Add data movement monitoring to avoid bottlenecks.

#### 4. KV Cache and Beyond
- AIBrix currently implements a distributed KV cache to accelerate LLM inference by storing attention states.
- LMCache added support for identical images so that image reuse could be identified (and prevent different images to reused the same KV due to the use of placeholder tokens) 
  - https://blog.lmcache.ai/2025-07-03-multimodal-models/
- (Beyond KV cache) In multimodal models, cacheable elements may include:
  - Visual embeddings (CLIP features).
  - Latent representations for image/video/audio.
- Extending caching semantics will improve throughput and latency for repeated multimodal contexts.
##### TODO 
- [ ] Validate and adding support for AIBrix to identify image/video/audio reuse. 
    - [ ] Benchmark benefits for common workloads (e.g., repeated image captioning over the same video frames).
- [ ] Extend cache API to allow multimodal engines to store/retrieve intermediate representations.
  - See some techniques mentioned here: https://github.com/xdit-project/xDiT?tab=readme-ov-file#cache-acceleration
    - https://liewfeng.github.io/TeaCache/
   - Generalize KV cache service to support arbitrary embedding/latent types.

#### 5. Modality-Specific Metrics
- Current AIBrix metrics (e.g., tokens/sec, request latency) are LLM-centric.
- Multimodal workloads require new observability dimensions:
  - Image/video size processed (pixels/frame count).
  - Audio duration.
  - Preprocessing latency (I/O, decoding, embedding).
- These metrics are crucial for autoscaling and SLA enforcement.
##### TODO 
- [ ] Extend telemetry pipeline with modality-aware metrics exporters.
- [ ] Define new autoscaling triggers, examples:
  - Scale on average pixels/sec.
  - Scale on average audio seconds/sec.
- [ ] Update dashboards to visualize cross-modality utilization trends.

#### 6. Pre/Post Processing Pipelines
- Multimodal inference typically requires non-trivial preprocessing and postprocessing:
  - Image decoding/normalization, resizing.
  - Audio resampling or spectrogram generation.
  - Video frame extraction.
  - Postprocessing for outputs (e.g., base64→PNG, WAV encoding).

vLLM's separation of preprocessing (e.g., encoding) pipeline separation is under development
- https://github.com/vllm-project/vllm/issues/20799
- https://github.com/vllm-project/vllm/pull/21740


- Three ways of servings:
    1. Frozen LLM + External Encoder (Encoder → projector → embeddings as prefix tokens to frozen LLM)

    
    <img width="668" height="320" alt="Image" src="https://github.com/user-attachments/assets/ab86e0f9-254c-4661-9da4-bfa9341ca049" />

    2. Unified Discrete Tokenization (Non-text → discrete tokens in shared vocab; train LLM jointly)
    
     <img width="639" height="117" alt="Image" src="https://github.com/user-attachments/assets/a7148ee5-aaf6-4ec0-868f-495cf75685df" />

    3. Orchestration / Tool Use (LLM calls external encoder services → text results fed back in)- Centralizing these within AIBrix prevents duplication across engines and enables standardized observability and scaling.
    
    <img width="668" height="113" alt="Image" src="https://github.com/user-attachments/assets/aa0d5dc7-fc9c-4870-a452-e50b00456445" />

##### TODO 
- [ ] Introduce pre/post-processing sidecar components or pipeline stages.
- [ ] Allow pluggable user-defined preprocessing modules for custom use cases.

### Alternatives Considered

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: Supporting multi-modality models in AIBrix #1509

Summary

Motivation

Proposed Change

1. Multimodal Engine Integration

TODO

2. Engine Adaptation Beyond LLMs

TODO

3. Large File Download / Access Support

TODO

4. KV Cache and Beyond

TODO

5. Modality-Specific Metrics

TODO

6. Pre/Post Processing Pipelines

TODO

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Supporting multi-modality models in AIBrix #1509

Description

Summary

Motivation

Proposed Change

1. Multimodal Engine Integration

TODO

2. Engine Adaptation Beyond LLMs

TODO

3. Large File Download / Access Support

TODO

4. KV Cache and Beyond

TODO

5. Modality-Specific Metrics

TODO

6. Pre/Post Processing Pipelines

TODO

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions