Cross-prompt KV caching¶
Cross-prompt KV caching is an optimization technique used in Large Language Model (LLM) inference to improve processing efficiency for consecutive prompts. It functions by identifying and reusing the Key-Value (KV) cache entries from the common prefix shared between the current prompt and the previous prompt^[001-TODO__mlx-engine.md].
This mechanism is a core feature of the CacheWrapper component, primarily utilized within the ModelKit architecture path^[001-TODO__mlx-engine.md]. It allows the engine to incrementally update the KV cache rather than recomputing states for the entire context from scratch for every new request^[001-TODO__mlx-engine.md].
Mechanism¶
The process relies on comparing the current input sequence with the cached state of the preceding interaction^[001-TODO__mlx-engine.md].
- Common Prefix Identification: The system calculates the length of the shared prefix between the current prompt tokens and the cached KV data^[001-TODO__mlx-engine.md].
- Incremental Processing: It identifies the "unprocessed tokens"—the portion of the current prompt that extends beyond the common prefix^[001-TODO__mlx-engine.md]. Only these tokens are passed to the model for prefilling^[001-TODO__mlx-engine.md].
- State Continuity: The existing KV cache is retained and extended with the new data, preserving the computational work done for the overlapping parts^[001-TODO__mlx-engine.md].
Features¶
- User Cancellation Support: If a user cancels a request during processing, the engine preserves the cache state that has been computed up to the point of interruption^[001-TODO__mlx-engine.md]. This allows the system to resume or utilize the partial cache for subsequent requests rather than discarding it.
- Chunked Prefilling: The processing of unprocessed tokens is handled in chunks (e.g.,
chunk_size=512), which facilitates progress reporting and responsiveness during long prefill phases^[001-TODO__mlx-engine.md].
Availability and Constraints¶
Support for cross-prompt caching depends on the specific model architecture and the initialization path used within the engine^[001-TODO__mlx-engine.md].
- Supported: Fully supported in the ModelKit path for text models and vision models with specific add-ons (e.g., Gemma3, Pixtral, Mistral3)^[001-TODO__mlx-engine.md].
- Unsupported: Not available for models initialized via the VisionModelKit path (generic vision models), which do not support this feature^[001-TODO__mlx-engine.md].
Related Concepts¶
- [[KV Cache]]
- [[Speculative Decoding]]
- KV Cache Quantization
Sources¶
001-TODO__mlx-engine.md