Vision add-on system for MLX Engine¶
The Vision add-on system is a modular plugin architecture within MLX Engine designed to extend standard text model pipelines to support Vision-Language Models (VLMs)^[001-TODO__mlx-engine.md]. It allows specific models to leverage high-performance engine features like KV caching and speculative decoding, while maintaining a fallback path for generic vision models.
This system resolves the architectural conflict between the highly optimized ModelKit (designed for text-only models) and the diverse requirements of visual inputs^[001-TODO__mlx-engine.md].
Architecture¶
The system operates by mapping specific model identifiers (model_type in config.json) to dedicated handler classes known as Vision Add-ons^[001-TODO__mlx-engine.md].
ModelKit vs. VisionModelKit¶
MLX Engine selects one of two execution paths based on the presence of a supported add-on^[001-TODO__mlx-engine.md]:
- ModelKit + Add-on: Used when the
model_typematches a key in theVISION_ADD_ON_MAP. This path integrates the add-on into the main engine.- Supports: KV cache quantization, cross-prompt caching, and speculative decoding^[001-TODO__mlx-engine.md].
- VisionModelKit: A generic wrapper around
mlx-vlmused when no specific add-on exists.- Restrictions: Does not support KV cache quantization, cross-prompt caching, or speculative decoding^[001-TODO__mlx-engine.md].
Supported Add-ons¶
The system currently includes specific add-ons for the following model architectures^[001-TODO__mlx-engine.md]:
model_type |
Add-on Class |
|---|---|
gemma3 |
Gemma3VisionAddOn |
gemma3n |
Gemma3nVisionAddOn |
pixtral |
PixtralVisionAddOn |
mistral3 |
Mistral3VisionAddOn |
lfm2-vl |
LFM2VisionAddOn |
Note: The
qwen2_vlseries (Qwen2-VL, Qwen2.5-VL) is currently disabled due to a known porting bug (Issue #237)^[001-TODO__mlx-engine.md].
Functionality¶
The primary role of a Vision Add-on is to inject visual processing capabilities into the ModelKit pipeline^[001-TODO__mlx-engine.md].
- Visual Input Processing: When a model is loaded via
load_model()with a matchingmodel_type, the corresponding add-on is attached. This enables the pipeline to acceptimages_b64(base64 encoded images) during generation^[001-TODO__mlx-engine.md]. - Feature Preservation: By using a dedicated add-on, the model retains access to engine-level optimizations. For example, a model using the
PixtralVisionAddOncan utilize [[Speculative Decoding]], whereas a model falling back toVisionModelKitcannot^[001-TODO__mlx-engine.md].
API Usage¶
The system is transparent to the end-user. If a downloaded model has a supported model_type, the vision capabilities are automatically enabled^[001-TODO__mlx-engine.md].
from mlx_engine import load_model, create_generator, tokenize
# Loading Pixtral (uses PixtralVisionAddOn)
model_kit = load_model("mlx-community/pixtral-12b-4bit")
# Generation automatically handles images if the add-on is present
for result in create_generator(
model_kit,
prompt_tokens,
images_b64=["<base64_string>"]
):
print(result.text)
Limitations¶
- Fallback Constraints: If a vision model is not explicitly mapped in the add-on system, it defaults to the generic
VisionModelKit. In this mode, KV Cache Quantization and cross-prompt caching are unavailable, which may impact performance and memory usage^[001-TODO__mlx-engine.md]. - Draft Models: Similar to the generic
VisionModelKit, draft models used in speculative decoding do not support vision inputs^[001-TODO__mlx-engine.md].
Related Concepts¶
- [[Apple MLX]]
- [[KV Cache]]
- [[Speculative Decoding]]
Sources¶
001-TODO__mlx-engine.md