Skip to content

Vision add-on system for MLX Engine

The Vision add-on system is a modular plugin architecture within MLX Engine designed to extend standard text model pipelines to support Vision-Language Models (VLMs)^[001-TODO__mlx-engine.md]. It allows specific models to leverage high-performance engine features like KV caching and speculative decoding, while maintaining a fallback path for generic vision models.

This system resolves the architectural conflict between the highly optimized ModelKit (designed for text-only models) and the diverse requirements of visual inputs^[001-TODO__mlx-engine.md].

Architecture

The system operates by mapping specific model identifiers (model_type in config.json) to dedicated handler classes known as Vision Add-ons^[001-TODO__mlx-engine.md].

ModelKit vs. VisionModelKit

MLX Engine selects one of two execution paths based on the presence of a supported add-on^[001-TODO__mlx-engine.md]:

  • ModelKit + Add-on: Used when the model_type matches a key in the VISION_ADD_ON_MAP. This path integrates the add-on into the main engine.
    • Supports: KV cache quantization, cross-prompt caching, and speculative decoding^[001-TODO__mlx-engine.md].
  • VisionModelKit: A generic wrapper around mlx-vlm used when no specific add-on exists.
    • Restrictions: Does not support KV cache quantization, cross-prompt caching, or speculative decoding^[001-TODO__mlx-engine.md].

Supported Add-ons

The system currently includes specific add-ons for the following model architectures^[001-TODO__mlx-engine.md]:

model_type Add-on Class
gemma3 Gemma3VisionAddOn
gemma3n Gemma3nVisionAddOn
pixtral PixtralVisionAddOn
mistral3 Mistral3VisionAddOn
lfm2-vl LFM2VisionAddOn

Note: The qwen2_vl series (Qwen2-VL, Qwen2.5-VL) is currently disabled due to a known porting bug (Issue #237)^[001-TODO__mlx-engine.md].

Functionality

The primary role of a Vision Add-on is to inject visual processing capabilities into the ModelKit pipeline^[001-TODO__mlx-engine.md].

  • Visual Input Processing: When a model is loaded via load_model() with a matching model_type, the corresponding add-on is attached. This enables the pipeline to accept images_b64 (base64 encoded images) during generation^[001-TODO__mlx-engine.md].
  • Feature Preservation: By using a dedicated add-on, the model retains access to engine-level optimizations. For example, a model using the PixtralVisionAddOn can utilize [[Speculative Decoding]], whereas a model falling back to VisionModelKit cannot^[001-TODO__mlx-engine.md].

API Usage

The system is transparent to the end-user. If a downloaded model has a supported model_type, the vision capabilities are automatically enabled^[001-TODO__mlx-engine.md].

from mlx_engine import load_model, create_generator, tokenize

# Loading Pixtral (uses PixtralVisionAddOn)
model_kit = load_model("mlx-community/pixtral-12b-4bit")

# Generation automatically handles images if the add-on is present
for result in create_generator(
    model_kit,
    prompt_tokens,
    images_b64=["<base64_string>"]
):
    print(result.text)

Limitations

  • Fallback Constraints: If a vision model is not explicitly mapped in the add-on system, it defaults to the generic VisionModelKit. In this mode, KV Cache Quantization and cross-prompt caching are unavailable, which may impact performance and memory usage^[001-TODO__mlx-engine.md].
  • Draft Models: Similar to the generic VisionModelKit, draft models used in speculative decoding do not support vision inputs^[001-TODO__mlx-engine.md].
  • [[Apple MLX]]
  • [[KV Cache]]
  • [[Speculative Decoding]]

Sources

  • 001-TODO__mlx-engine.md