MLX Engine architecture¶

MLX Engine architecture is a modular Python framework designed for high-performance LLM inference on Apple Silicon (M-series chips). It acts as a wrapper around core libraries like mlx-lm and mlx-vlm, providing a unified API for loading models, managing KV caches, and optimizing generation via techniques like speculative decoding and quantization^[001-TODO__mlx-engine.md].

The architecture is structured to maximize hardware utilization on macOS while abstracting away the complexity of underlying model implementations^[001-TODO__mlx-engine.md].

Directory Structure¶

The codebase is organized into distinct functional modules^[001-TODO__mlx-engine.md]:

mlx_engine/:
- __init__.py: Public API entry points (e.g., load_model, create_generator).
- generate.py: Core generation pipeline wrapper around mlx-lm.
- cache_wrapper.py: Logic for KV cache management and cross-prompt caching.
- processors/: Logit processors (e.g., repetition penalty, stop strings).
model_kit/: Contains ModelKit for text/vision models and vision_add_ons/ for specific vision model plugins (e.g., Pixtral, Gemma3).
vision_model_kit/: VisionModelKit wrapper for generic mlx-vlm models.
utils/: Optimization utilities, including speculative decoding, KV cache quantization, and prompt processing.

Dual-Path Architecture¶

The engine employs a dual-path initialization strategy depending on the model's capabilities, determined by the model_type in config.json^[001-TODO__mlx-engine.md].

1. ModelKit Path¶

This is the high-performance path used for text models and vision models with specific "add-on" plugins. It enables the full suite of engine optimizations^[001-TODO__mlx-engine.md]: * KV Cache Quantization: Reduces memory usage via kv_bits and kv_group_size. * Cross-Prompt Caching: Reuses computation from previous prompts. * Speculative Decoding: Accelerates generation using a draft model.

Supported model_type values for this path include gemma3, pixtral, and mistral3^[001-TODO__mlx-engine.md].

2. VisionModelKit Path¶

Used for generic visual models or those without specific add-ons. It wraps mlx-vlm for broader compatibility but lacks the advanced optimizations of the ModelKit path^[001-TODO__mlx-engine.md]: * No KV cache quantization. * No cross-prompt caching. * No speculative decoding.

Key Architectural Components¶

CacheWrapper & Caching Strategy¶

The CacheWrapper is responsible for managing the Key-Value (KV) cache state^[001-TODO__mlx-engine.md]. * Incremental Updates: It maintains cache state across generations. * Common Prefix Optimization: The _find_common_prefix() method identifies shared tokens between the current and previous prompts to avoid recomputing them. * Chunked Prefill: The _prefill() method processes prompts in chunks (default 512 tokens), supporting progress callbacks and graceful cancellation^[001-TODO__mlx-engine.md].

Speculative Decoding¶

This utility allows the use of a smaller "draft" model to predict tokens that are then verified in parallel by the main model^[001-TODO__mlx-engine.md]. * Execution: The draft model generates tokens, and the main model validates them. * Cache Merging: The set_draft_model() function merges the draft model's cache into the main model's cache^[001-TODO__mlx-engine.md]. * Limitation: Draft models are not supported for visual models.

KV cache quantization ¶

The architecture supports dynamic quantization of the KV cache to reduce memory footprint^[001-TODO__mlx-engine.md]. * Configuration: Controlled by kv_bits and kv_group_size parameters. * Trigger: Applied via maybe_quantize_kv_cache() during the prefill phase. * Constraint: Once quantized, the max_kv_size limit is effectively ignored due to non-rotational cache constraints.

Data Flow Pipeline¶

The generation process follows a structured pipeline from input to output^[001-TODO__mlx-engine.md]:

Input: User prompt or images + configuration (temp, top_p).
Tokenization: tokenize() converts input to tokens.
Prompt Processing:
- Check cross-prompt cache for common prefixes.
- Run prefill on unprocessed tokens via CacheWrapper.
Generation: stream_generate() (via mlx-lm) yields raw token results.
Post-Processing:
- Logit Processors: Apply penalties or filters.
- Stop Conditions: StopStringProcessor checks for custom stop strings or EOS tokens.
Output: GenerationResult objects containing text, tokens, and logprobs.

Supported Model Types¶

The architecture differentiates between standard text models and Vision-Language Models (VLMs)^[001-TODO__mlx-engine.md].

Text Models: Full support for all features (quantization, caching, speculative decoding).
Vision Models (Add-on): Models like Pixtral or Gemma3 use specific plugins in model_kit/vision_add_ons/ and support most optimizations.
Generic Vision Models: Handled via VisionModelKit, optimized for compatibility over raw performance.

Sources¶

001-TODO__mlx-engine.md

[[LLM Inference]]
[[KV Cache]]
[[Speculative Decoding]]
[[Quantization]]
[[Apple MLX]]