Speculative decoding in MLX Engine¶

Speculative decoding is an optimization technique in MLX Engine designed to accelerate text generation by utilizing a smaller "draft" model to predict tokens that are verified in parallel by a larger main model^[001-TODO__mlx-engine.md].

This feature is exclusively available for text models processed via the ModelKit architecture^[001-TODO__mlx-engine.md].

Architecture & Compatibility¶

Speculative decoding relies on a dual-model setup consisting of a main model and a draft model^[001-TODO__mlx-engine.md].

Main Model: The target model (e.g., Llama-3.1-8B) used for final inference^[001-TODO__mlx-engine.md].
Draft Model: A significantly smaller and faster model (e.g., Qwen2.5-0.5B) responsible for proposing candidate tokens^[001-TODO__mlx-engine.md].

Compatibility Constraints¶

MLX Engine enforces strict compatibility checks to ensure the draft model's vocabulary aligns with the main model^[001-TODO__mlx-engine.md]. * Mechanism: The is_draft_model_compatible() function verifies if a draft model can be used with the currently loaded main model^[001-TODO__mlx-engine.md]. * Visual Models: Speculative decoding is not supported for visual models (e.g., VisionModelKit) or models using specific vision add-ons (like Pixtral or Gemma3)^[001-TODO__mlx-engine.md].

Workflow¶

The implementation involves three main steps^[001-TODO__mlx-engine.md]:

Verification: Before loading, is_draft_model_compatible() checks if the draft model's tokenizer and architecture match the main model.
Loading: If compatible, load_draft_model() is called to attach the draft model to the main model instance^[001-TODO__mlx-engine.md].
Generation:
- The draft model rapidly generates a sequence of tokens (speculating the future tokens).
- The main model verifies these tokens in a single pass.
- If a token is rejected, the main model corrects it. Accepted tokens are kept, effectively increasing the tokens generated per step^[001-TODO__mlx-engine.md].

Implementation Details¶

Cache Management: The set_draft_model() method is responsible for merging the draft model's KV cache with the main model's cache, ensuring the context is preserved during the hand-off^[001-TODO__mlx-engine.md].
Acceleration Goal: By offloading the sequential token generation to the smaller model, the system reduces the computational load on the large model, leading to faster inference speeds^[001-TODO__mlx-engine.md].

Usage Example¶

from mlx_engine import load_model, load_draft_model, is_draft_model_compatible

# 1. Load the main model
model_kit = load_model("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")

# 2. Check compatibility and load draft model
draft_id = "lmstudio-community/Qwen2.5-0.5B-Instruct-MLX-8bit"
if is_draft_model_compatible(model_kit, draft_id):
    load_draft_model(model_kit, draft_id)
    # Subsequent generation calls will now use speculative decoding

[[MLX Engine]]
[[KV Cache]]
[[ModelKit]]

Sources¶

001-TODO__mlx-engine.md