Skip to content

NVFP4 quantization format

NVFP4 quantization format is a specific 4-bit quantization configuration utilized to compress Large Language Models (LLMs) for efficient inference^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md]. It enables large models, which typically require substantial memory, to run on local hardware by significantly reducing their storage footprint^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md].

While the "NV" prefix suggests an association with NVIDIA standards, this format is effectively supported by Apple Silicon's MLX engine via the Ollama platform, allowing it to leverage unified memory for high-performance local inference^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md].

Usage and Performance

In the context of local LLM deployment, NVFP4 is used to create model variants that balance size and speed^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md].

Example Case: Qwen 3.5 35B

When the Qwen 3.5 35B model is quantized using NVFP4: * Compressed Size: The model is reduced to approximately 21 GB (further compressed to 18.66 GB in storage)^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md]. * Hardware Requirements: A minimum of 32 GB of unified memory is recommended^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md]. * Inference Speed: On an Apple M3 Max (36 GB RAM), the NVFP4-quantized model achieved a generation speed of roughly 65–66 tokens/second, with prompt processing at 5.3 tokens/second^[001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md].

  • [[Quantization (LLM)]]
  • [[Ollama]]
  • [[Apple Silicon]]

Sources

  • 001-TODO__Ollama_MLX_Support_MacBook_Local_LLM.md