Skip to content

MLX Engine architecture

MLX Engine architecture is a modular Python framework designed for high-performance LLM inference on Apple Silicon (M-series chips). It acts as a wrapper around core libraries like mlx-lm and mlx-vlm, providing a unified API for loading models, managing KV caches, and optimizing generation via techniques like speculative decoding and quantization^[001-TODO__mlx-engine.md].

The architecture is structured to maximize hardware utilization on macOS while abstracting away the complexity of underlying model implementations^[001-TODO__mlx-engine.md].

Directory Structure

The codebase is organized into distinct functional modules^[001-TODO__mlx-engine.md]:

  • mlx_engine/:
    • __init__.py: Public API entry points (e.g., load_model, create_generator).
    • generate.py: Core generation pipeline wrapper around mlx-lm.
    • cache_wrapper.py: Logic for KV cache management and cross-prompt caching.
    • processors/: Logit processors (e.g., repetition penalty, stop strings).
  • model_kit/: Contains ModelKit for text/vision models and vision_add_ons/ for specific vision model plugins (e.g., Pixtral, Gemma3).
  • vision_model_kit/: VisionModelKit wrapper for generic mlx-vlm models.
  • utils/: Optimization utilities, including speculative decoding, KV cache quantization, and prompt processing.

Dual-Path Architecture

The engine employs a dual-path initialization strategy depending on the model's capabilities, determined by the model_type in config.json^[001-TODO__mlx-engine.md].

1. ModelKit Path

This is the high-performance path used for text models and vision models with specific "add-on" plugins. It enables the full suite of engine optimizations^[001-TODO__mlx-engine.md]: * KV Cache Quantization: Reduces memory usage via kv_bits and kv_group_size. * Cross-Prompt Caching: Reuses computation from previous prompts. * Speculative Decoding: Accelerates generation using a draft model.

Supported model_type values for this path include gemma3, pixtral, and mistral3^[001-TODO__mlx-engine.md].

2. VisionModelKit Path

Used for generic visual models or those without specific add-ons. It wraps mlx-vlm for broader compatibility but lacks the advanced optimizations of the ModelKit path^[001-TODO__mlx-engine.md]: * No KV cache quantization. * No cross-prompt caching. * No speculative decoding.

Key Architectural Components

CacheWrapper & Caching Strategy

The CacheWrapper is responsible for managing the Key-Value (KV) cache state^[001-TODO__mlx-engine.md]. * Incremental Updates: It maintains cache state across generations. * Common Prefix Optimization: The _find_common_prefix() method identifies shared tokens between the current and previous prompts to avoid recomputing them. * Chunked Prefill: The _prefill() method processes prompts in chunks (default 512 tokens), supporting progress callbacks and graceful cancellation^[001-TODO__mlx-engine.md].

Speculative Decoding

This utility allows the use of a smaller "draft" model to predict tokens that are then verified in parallel by the main model^[001-TODO__mlx-engine.md]. * Execution: The draft model generates tokens, and the main model validates them. * Cache Merging: The set_draft_model() function merges the draft model's cache into the main model's cache^[001-TODO__mlx-engine.md]. * Limitation: Draft models are not supported for visual models.

KV cache quantization

The architecture supports dynamic quantization of the KV cache to reduce memory footprint^[001-TODO__mlx-engine.md]. * Configuration: Controlled by kv_bits and kv_group_size parameters. * Trigger: Applied via maybe_quantize_kv_cache() during the prefill phase. * Constraint: Once quantized, the max_kv_size limit is effectively ignored due to non-rotational cache constraints.

Data Flow Pipeline

The generation process follows a structured pipeline from input to output^[001-TODO__mlx-engine.md]:

  1. Input: User prompt or images + configuration (temp, top_p).
  2. Tokenization: tokenize() converts input to tokens.
  3. Prompt Processing:
    • Check cross-prompt cache for common prefixes.
    • Run prefill on unprocessed tokens via CacheWrapper.
  4. Generation: stream_generate() (via mlx-lm) yields raw token results.
  5. Post-Processing:
    • Logit Processors: Apply penalties or filters.
    • Stop Conditions: StopStringProcessor checks for custom stop strings or EOS tokens.
  6. Output: GenerationResult objects containing text, tokens, and logprobs.

Supported Model Types

The architecture differentiates between standard text models and Vision-Language Models (VLMs)^[001-TODO__mlx-engine.md].

  • Text Models: Full support for all features (quantization, caching, speculative decoding).
  • Vision Models (Add-on): Models like Pixtral or Gemma3 use specific plugins in model_kit/vision_add_ons/ and support most optimizations.
  • Generic Vision Models: Handled via VisionModelKit, optimized for compatibility over raw performance.

Sources

  • 001-TODO__mlx-engine.md
  • [[LLM Inference]]
  • [[KV Cache]]
  • [[Speculative Decoding]]
  • [[Quantization]]
  • [[Apple MLX]]