MLX Engine architecture¶
MLX Engine architecture is a modular Python framework designed for high-performance LLM inference on Apple Silicon (M-series chips). It acts as a wrapper around core libraries like mlx-lm and mlx-vlm, providing a unified API for loading models, managing KV caches, and optimizing generation via techniques like speculative decoding and quantization^[001-TODO__mlx-engine.md].
The architecture is structured to maximize hardware utilization on macOS while abstracting away the complexity of underlying model implementations^[001-TODO__mlx-engine.md].
Directory Structure¶
The codebase is organized into distinct functional modules^[001-TODO__mlx-engine.md]:
mlx_engine/:__init__.py: Public API entry points (e.g.,load_model,create_generator).generate.py: Core generation pipeline wrapper aroundmlx-lm.cache_wrapper.py: Logic for KV cache management and cross-prompt caching.processors/: Logit processors (e.g., repetition penalty, stop strings).
model_kit/: ContainsModelKitfor text/vision models andvision_add_ons/for specific vision model plugins (e.g., Pixtral, Gemma3).vision_model_kit/:VisionModelKitwrapper for genericmlx-vlmmodels.utils/: Optimization utilities, including speculative decoding, KV cache quantization, and prompt processing.
Dual-Path Architecture¶
The engine employs a dual-path initialization strategy depending on the model's capabilities, determined by the model_type in config.json^[001-TODO__mlx-engine.md].
1. ModelKit Path¶
This is the high-performance path used for text models and vision models with specific "add-on" plugins. It enables the full suite of engine optimizations^[001-TODO__mlx-engine.md]:
* KV Cache Quantization: Reduces memory usage via kv_bits and kv_group_size.
* Cross-Prompt Caching: Reuses computation from previous prompts.
* Speculative Decoding: Accelerates generation using a draft model.
Supported model_type values for this path include gemma3, pixtral, and mistral3^[001-TODO__mlx-engine.md].
2. VisionModelKit Path¶
Used for generic visual models or those without specific add-ons. It wraps mlx-vlm for broader compatibility but lacks the advanced optimizations of the ModelKit path^[001-TODO__mlx-engine.md]:
* No KV cache quantization.
* No cross-prompt caching.
* No speculative decoding.
Key Architectural Components¶
CacheWrapper & Caching Strategy¶
The CacheWrapper is responsible for managing the Key-Value (KV) cache state^[001-TODO__mlx-engine.md].
* Incremental Updates: It maintains cache state across generations.
* Common Prefix Optimization: The _find_common_prefix() method identifies shared tokens between the current and previous prompts to avoid recomputing them.
* Chunked Prefill: The _prefill() method processes prompts in chunks (default 512 tokens), supporting progress callbacks and graceful cancellation^[001-TODO__mlx-engine.md].
Speculative Decoding¶
This utility allows the use of a smaller "draft" model to predict tokens that are then verified in parallel by the main model^[001-TODO__mlx-engine.md].
* Execution: The draft model generates tokens, and the main model validates them.
* Cache Merging: The set_draft_model() function merges the draft model's cache into the main model's cache^[001-TODO__mlx-engine.md].
* Limitation: Draft models are not supported for visual models.
KV cache quantization¶
The architecture supports dynamic quantization of the KV cache to reduce memory footprint^[001-TODO__mlx-engine.md].
* Configuration: Controlled by kv_bits and kv_group_size parameters.
* Trigger: Applied via maybe_quantize_kv_cache() during the prefill phase.
* Constraint: Once quantized, the max_kv_size limit is effectively ignored due to non-rotational cache constraints.
Data Flow Pipeline¶
The generation process follows a structured pipeline from input to output^[001-TODO__mlx-engine.md]:
- Input: User prompt or images + configuration (temp, top_p).
- Tokenization:
tokenize()converts input to tokens. - Prompt Processing:
- Check cross-prompt cache for common prefixes.
- Run prefill on unprocessed tokens via
CacheWrapper.
- Generation:
stream_generate()(viamlx-lm) yields raw token results. - Post-Processing:
- Logit Processors: Apply penalties or filters.
- Stop Conditions:
StopStringProcessorchecks for custom stop strings or EOS tokens.
- Output:
GenerationResultobjects containing text, tokens, and logprobs.
Supported Model Types¶
The architecture differentiates between standard text models and Vision-Language Models (VLMs)^[001-TODO__mlx-engine.md].
- Text Models: Full support for all features (quantization, caching, speculative decoding).
- Vision Models (Add-on): Models like Pixtral or Gemma3 use specific plugins in
model_kit/vision_add_ons/and support most optimizations. - Generic Vision Models: Handled via
VisionModelKit, optimized for compatibility over raw performance.
Sources¶
001-TODO__mlx-engine.md
Related Concepts¶
- [[LLM Inference]]
- [[KV Cache]]
- [[Speculative Decoding]]
- [[Quantization]]
- [[Apple MLX]]