Vision add-on system for MLX Engine¶

The Vision add-on system is a modular plugin architecture within MLX Engine designed to extend standard text model pipelines to support Vision-Language Models (VLMs)^[001-TODO__mlx-engine.md]. It allows specific models to leverage high-performance engine features like KV caching and speculative decoding, while maintaining a fallback path for generic vision models.

This system resolves the architectural conflict between the highly optimized ModelKit (designed for text-only models) and the diverse requirements of visual inputs^[001-TODO__mlx-engine.md].

Architecture¶

The system operates by mapping specific model identifiers (model_type in config.json) to dedicated handler classes known as Vision Add-ons^[001-TODO__mlx-engine.md].

ModelKit vs. VisionModelKit¶

MLX Engine selects one of two execution paths based on the presence of a supported add-on^[001-TODO__mlx-engine.md]:

ModelKit + Add-on: Used when the model_type matches a key in the VISION_ADD_ON_MAP. This path integrates the add-on into the main engine.
- Supports: KV cache quantization, cross-prompt caching, and speculative decoding^[001-TODO__mlx-engine.md].
VisionModelKit: A generic wrapper around mlx-vlm used when no specific add-on exists.
- Restrictions: Does not support KV cache quantization, cross-prompt caching, or speculative decoding^[001-TODO__mlx-engine.md].

Supported Add-ons¶

The system currently includes specific add-ons for the following model architectures^[001-TODO__mlx-engine.md]:

`model_type`	Add-on Class
`gemma3`	`Gemma3VisionAddOn`
`gemma3n`	`Gemma3nVisionAddOn`
`pixtral`	`PixtralVisionAddOn`
`mistral3`	`Mistral3VisionAddOn`
`lfm2-vl`	`LFM2VisionAddOn`

Note: The qwen2_vl series (Qwen2-VL, Qwen2.5-VL) is currently disabled due to a known porting bug (Issue #237)^[001-TODO__mlx-engine.md].

Functionality¶

The primary role of a Vision Add-on is to inject visual processing capabilities into the ModelKit pipeline^[001-TODO__mlx-engine.md].

Visual Input Processing: When a model is loaded via load_model() with a matching model_type, the corresponding add-on is attached. This enables the pipeline to accept images_b64 (base64 encoded images) during generation^[001-TODO__mlx-engine.md].
Feature Preservation: By using a dedicated add-on, the model retains access to engine-level optimizations. For example, a model using the PixtralVisionAddOn can utilize [[Speculative Decoding]], whereas a model falling back to VisionModelKit cannot^[001-TODO__mlx-engine.md].

API Usage¶

The system is transparent to the end-user. If a downloaded model has a supported model_type, the vision capabilities are automatically enabled^[001-TODO__mlx-engine.md].

from mlx_engine import load_model, create_generator, tokenize

# Loading Pixtral (uses PixtralVisionAddOn)
model_kit = load_model("mlx-community/pixtral-12b-4bit")

# Generation automatically handles images if the add-on is present
for result in create_generator(
    model_kit,
    prompt_tokens,
    images_b64=["<base64_string>"]
):
    print(result.text)

Limitations¶

Fallback Constraints: If a vision model is not explicitly mapped in the add-on system, it defaults to the generic VisionModelKit. In this mode, KV Cache Quantization and cross-prompt caching are unavailable, which may impact performance and memory usage^[001-TODO__mlx-engine.md].
Draft Models: Similar to the generic VisionModelKit, draft models used in speculative decoding do not support vision inputs^[001-TODO__mlx-engine.md].

[[Apple MLX]]
[[KV Cache]]
[[Speculative Decoding]]

Sources¶

001-TODO__mlx-engine.md