Long-context model tuning parameters¶

Long-context model tuning parameters refer to specific hyperparameter configurations applied to Large Language Models (LLMs) to optimize their performance and stability when processing inputs exceeding standard context lengths (often 100,000+ tokens). When models operate near their memory limits or retrieval limits, they may experience issues such as infinite repetition ("looping"), slowed inference, or decreased coherence. Tuning these parameters allows the model to handle massive contexts (e.g., analyzing entire codebases or long books) without "analysis paralysis"^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].

Key Parameters¶

The following parameters are commonly adjusted to stabilize long-context models:

temperature (--temp): Typically set to 1^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
- Purpose: Lowers creativity and divergence. In long-context scenarios, high temperature can cause the model to hallucinate or drift away from the source material.
top-p: Typically set to 0.9^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
- Purpose: Nucleus sampling parameter that restricts the model to the smallest set of words whose cumulative probability exceeds the threshold. A value of 0.9 limits the selection range, improving focus.
top-k: Typically set to 20^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
- Purpose: Limits the next token selection to the top 20 most probable words. This is a strict filter to prevent the model from choosing low-probability "outlier" words that might trigger incoherence in long generations.
min-p: Typically set to 0.1^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
- Purpose: Defines a minimum probability threshold relative to the best token. It acts as a floor to remove extremely unlikely options.
repeat-penalty: Typically set to 1.05^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
- Purpose: Penalizes the model for repeating tokens that have already appeared in the output. This is crucial for preventing "infinite loops" or "babbling" where the model gets stuck in a repetitive cycle during long inference tasks.

Visual Task Settings¶

For long-context models capable of multimodal tasks (images), specific token limits are often required to prevent severe hallucinations^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].

image-min-tokens: Set to 300^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
image-max-tokens: Set to 512^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].

Use Cases¶

These parameters are critical when using [[MoE Architecture]] models or highly quantized local models for:

Document Analysis: Ingesting entire financial reports or legal briefs (>200k tokens) where the model must extract specific facts without getting lost^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
Codebase Understanding: "Needle-in-a-haystack" retrieval where the model must find a specific function in a massive project directory^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].
Local AI Agents: Reducing latency for [[AI Agents]] that require dozens of self-correction loops, where cloud latency would be detrimental^[001-TODO__Gemma_4_26B_本地AI模型深度解析.md].

[[Turbo Quant]]
[[KV Cache]]
[[MoE Architecture]]
20/80 Learning Principle

Sources¶

001-TODO__Gemma_4_26B_本地AI模型深度解析.md

Long-context model tuning parameters¶

Key Parameters¶

Visual Task Settings¶

Use Cases¶

Related Concepts¶

Sources¶