Skip to content

KV cache quantization

KV cache quantization is a memory optimization technique used in Large Language Model (LLM) inference. It reduces the memory footprint of the Key-Value (KV) cache by storing cached vectors in a lower-precision numerical format (e.g., 4-bit or 8-bit integers) rather than full precision (typically 16-bit floating point)^[001-TODO__mlx-engine.md].

This process is applied during the prefill phase of the generation cycle^[001-TODO__mlx-engine.md].

Function and Benefits

The primary purpose of KV cache quantization is to lower the VRAM (Video RAM) consumption of the model, allowing for larger context windows or larger batch sizes on hardware with limited memory^[001-TODO__mlx-engine.md]. By reducing the number of bits allocated to each KV cache entry (kv_bits), the system can trade a marginal amount of precision for significant memory savings.

Configuration

In mlx-engine, this feature is configured via parameters passed during the model loading phase^[001-TODO__mlx-engine.md].

  • kv_bits: Specifies the number of bits to use for quantization (e.g., 4 or 8). Setting this to None disables quantization^[001-TODO__mlx-engine.md].
  • kv_group_size: Defines the group size used for the quantization algorithm^[001-TODO__mlx-engine.md].
  • quantized_kv_start: Determines the specific step (token count) at which quantization should begin^[001-TODO__mlx-engine.md].

Implementation Details

  • Quantization Method: The quantization is applied using the maybe_quantize_kv_cache() utility^[001-TODO__mlx-engine.md].
  • Context Window Constraint: When KV cache quantization is enabled, the max_kv_size parameter (which normally defines the fixed context window size for non-rotating caches) is ignored^[001-TODO__mlx-engine.md].

Model Support

KV cache quantization is supported primarily in standard text-based models and vision models that utilize the ModelKit architecture (such as Pixtral)^[001-TODO__mlx-engine.md].

It is not supported for generic vision models handled by VisionModelKit, which wraps mlx-vlm^[001-TODO__mlx-engine.md].

  • [[LLM Inference]]
  • [[Speculative Decoding]]
  • [[Cross-prompt Caching]]
  • [[Prefill Phase]]

Sources

  • 001-TODO__mlx-engine.md