Cross-prompt KV caching¶

Cross-prompt KV caching is an optimization technique used in Large Language Model (LLM) inference to improve processing efficiency for consecutive prompts. It functions by identifying and reusing the Key-Value (KV) cache entries from the common prefix shared between the current prompt and the previous prompt^[001-TODO__mlx-engine.md].

This mechanism is a core feature of the CacheWrapper component, primarily utilized within the ModelKit architecture path^[001-TODO__mlx-engine.md]. It allows the engine to incrementally update the KV cache rather than recomputing states for the entire context from scratch for every new request^[001-TODO__mlx-engine.md].

Mechanism¶

The process relies on comparing the current input sequence with the cached state of the preceding interaction^[001-TODO__mlx-engine.md].

Common Prefix Identification: The system calculates the length of the shared prefix between the current prompt tokens and the cached KV data^[001-TODO__mlx-engine.md].
Incremental Processing: It identifies the "unprocessed tokens"—the portion of the current prompt that extends beyond the common prefix^[001-TODO__mlx-engine.md]. Only these tokens are passed to the model for prefilling^[001-TODO__mlx-engine.md].
State Continuity: The existing KV cache is retained and extended with the new data, preserving the computational work done for the overlapping parts^[001-TODO__mlx-engine.md].

Features¶

User Cancellation Support: If a user cancels a request during processing, the engine preserves the cache state that has been computed up to the point of interruption^[001-TODO__mlx-engine.md]. This allows the system to resume or utilize the partial cache for subsequent requests rather than discarding it.
Chunked Prefilling: The processing of unprocessed tokens is handled in chunks (e.g., chunk_size=512), which facilitates progress reporting and responsiveness during long prefill phases^[001-TODO__mlx-engine.md].

Availability and Constraints¶

Support for cross-prompt caching depends on the specific model architecture and the initialization path used within the engine^[001-TODO__mlx-engine.md].

Supported: Fully supported in the ModelKit path for text models and vision models with specific add-ons (e.g., Gemma3, Pixtral, Mistral3)^[001-TODO__mlx-engine.md].
Unsupported: Not available for models initialized via the VisionModelKit path (generic vision models), which do not support this feature^[001-TODO__mlx-engine.md].

[[KV Cache]]
[[Speculative Decoding]]
KV Cache Quantization

Sources¶

001-TODO__mlx-engine.md

Cross-prompt KV caching¶

Mechanism¶

Features¶

Availability and Constraints¶

Related Concepts¶

Sources¶