KV cache quantization¶
KV cache quantization is a memory optimization technique used in Large Language Model (LLM) inference. It reduces the memory footprint of the Key-Value (KV) cache by storing cached vectors in a lower-precision numerical format (e.g., 4-bit or 8-bit integers) rather than full precision (typically 16-bit floating point)^[001-TODO__mlx-engine.md].
This process is applied during the prefill phase of the generation cycle^[001-TODO__mlx-engine.md].
Function and Benefits¶
The primary purpose of KV cache quantization is to lower the VRAM (Video RAM) consumption of the model, allowing for larger context windows or larger batch sizes on hardware with limited memory^[001-TODO__mlx-engine.md]. By reducing the number of bits allocated to each KV cache entry (kv_bits), the system can trade a marginal amount of precision for significant memory savings.
Configuration¶
In mlx-engine, this feature is configured via parameters passed during the model loading phase^[001-TODO__mlx-engine.md].
kv_bits: Specifies the number of bits to use for quantization (e.g., 4 or 8). Setting this toNonedisables quantization^[001-TODO__mlx-engine.md].kv_group_size: Defines the group size used for the quantization algorithm^[001-TODO__mlx-engine.md].quantized_kv_start: Determines the specific step (token count) at which quantization should begin^[001-TODO__mlx-engine.md].
Implementation Details¶
- Quantization Method: The quantization is applied using the
maybe_quantize_kv_cache()utility^[001-TODO__mlx-engine.md]. - Context Window Constraint: When KV cache quantization is enabled, the
max_kv_sizeparameter (which normally defines the fixed context window size for non-rotating caches) is ignored^[001-TODO__mlx-engine.md].
Model Support¶
KV cache quantization is supported primarily in standard text-based models and vision models that utilize the ModelKit architecture (such as Pixtral)^[001-TODO__mlx-engine.md].
It is not supported for generic vision models handled by VisionModelKit, which wraps mlx-vlm^[001-TODO__mlx-engine.md].
Related Concepts¶
- [[LLM Inference]]
- [[Speculative Decoding]]
- [[Cross-prompt Caching]]
- [[Prefill Phase]]
Sources¶
001-TODO__mlx-engine.md