Skip to content

[RFC]: Supporting multi-modality models in AIBrix #1509

@happyandslow

Description

@happyandslow

Summary

AIBrix has been successfully adopted as a cloud-native orchestration framework for LLM inference engines (e.g., vLLM, SGLang). To expand its scope into multimodal AI inference, new capabilities are required across the gateway, engine integration, data handling, caching, metrics, and pre/post-processing layers. This RFC motivates and details the prioritized tasks needed to extend AIBrix to fully support multimodal scenarios.

Motivation

By extending AIBrix’s architecture in these ways, we could enable AIBrix to be a unified inference orchestration platform—handling not just text LLMs, but richer, multimodal workloads—while retaining its strengths in routing, cost-efficiency, and scalability.

Proposed Change

1. Multimodal Engine Integration

  • Many multimodal models (e.g., Qwen-VL, LLaVA, GPT-4o) accept text + image input and return text output.
  • OpenAI’s /chat/completions and /completions APIs already support mixed input payloads ({"type": "text"} + {"type": "image_url"}), which are widely adopted.
TODO

2. Engine Adaptation Beyond LLMs

  • Current AIBrix engine integrations (vLLM, SGLang) assume an OpenAI-compatible LLM interface.
  • Diffusion models (e.g., xDiT, https://github.com/xdit-project/xDiT) expose non-OpenAI APIs with different input/output formats.
  • Supporting such engines would expand AIBrix beyond text and vision-language models into generative image/video/audio domains.
TODO

3. Large File Download / Access Support

  • Multimodal models often need to handle large input files (video/audio) or large outputs (image generations, video clips).
  • Current API payloads are insufficient for multi-100MB content.
  • Industry practice (e.g., OpenAI file uploads, AWS Bedrock) relies on remote storage integration (S3/GCS/Azure Blob).
TODO
  • Extend AIBrix Gateway to:
    • Support file references (URIs) instead of inline payloads.
    • Integrate with cloud storage (e.g., S3) -compatible object storage for upload/download.
  • Define policies for temporary storage, caching, and cleanup of large multimodal assets.
  • Add data movement monitoring to avoid bottlenecks.

4. KV Cache and Beyond

  • AIBrix currently implements a distributed KV cache to accelerate LLM inference by storing attention states.
  • LMCache added support for identical images so that image reuse could be identified (and prevent different images to reused the same KV due to the use of placeholder tokens)
  • (Beyond KV cache) In multimodal models, cacheable elements may include:
    • Visual embeddings (CLIP features).
    • Latent representations for image/video/audio.
  • Extending caching semantics will improve throughput and latency for repeated multimodal contexts.
TODO

5. Modality-Specific Metrics

  • Current AIBrix metrics (e.g., tokens/sec, request latency) are LLM-centric.
  • Multimodal workloads require new observability dimensions:
    • Image/video size processed (pixels/frame count).
    • Audio duration.
    • Preprocessing latency (I/O, decoding, embedding).
  • These metrics are crucial for autoscaling and SLA enforcement.
TODO
  • Extend telemetry pipeline with modality-aware metrics exporters.
  • Define new autoscaling triggers, examples:
    • Scale on average pixels/sec.
    • Scale on average audio seconds/sec.
  • Update dashboards to visualize cross-modality utilization trends.

6. Pre/Post Processing Pipelines

  • Multimodal inference typically requires non-trivial preprocessing and postprocessing:
    • Image decoding/normalization, resizing.
    • Audio resampling or spectrogram generation.
    • Video frame extraction.
    • Postprocessing for outputs (e.g., base64→PNG, WAV encoding).

vLLM's separation of preprocessing (e.g., encoding) pipeline separation is under development

TODO
  • Introduce pre/post-processing sidecar components or pipeline stages.
  • Allow pluggable user-defined preprocessing modules for custom use cases.

Alternatives Considered

None

Metadata

Metadata

Assignees

Labels

area/inference-enginekind/enhancementNew feature or requestkind/featureCategorizes issue or PR as related to a new feature.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions