Token predictability scoring¶
Token predictability scoring is a metric or methodology used to evaluate the likelihood of specific tokens (words or sub-words) appearing in a sequence, often based on their "reconstructability" by a language model^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. In the context of Large Language Model (LLM) optimization, it serves as a mechanism to distinguish between predictable grammatical structures and unpredictable factual content^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].
This concept is central to compression techniques like Caveman Compression, where tokens identified as highly predictable (and therefore redundant) are removed to reduce context window usage without losing semantic meaning^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Core Principle¶
The fundamental premise of token predictability scoring is that LLMs are effective at filling in linguistic gaps^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. A language model can reliably infer syntax, articles, and connector words based on surrounding context. Therefore, tokens that are easily predicted contribute less to the unique information density of a prompt or document^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].
Scoring or assessing predictability allows systems to identify: * High-Predictability Tokens: Grammar glue, articles ("the", "a"), passive voice markers, and common connectors that can be safely omitted^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. * Low-Predictability Tokens: Specific entities, data points, technical constraints, and unique facts that must be preserved to maintain information integrity^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].
Application in Compression¶
In semantic compression workflows, token predictability is used to filter content rather than simply truncating it^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
- MLM-based Compression: Masked Language Models (such as RoBERTa) are utilized to calculate token probabilities. The system removes the top-k most predictable tokens (e.g., top 30%), effectively stripping away the "reconstructable" parts of the text^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
- Rule-based Filtering: Techniques like NLP-based compression use fixed grammatical rules to target parts of speech that statistically rank high in predictability (e.g., "therefore", "however")^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Strategic Value¶
By utilizing predictability as a scoring metric, developers can achieve significant token savings (often ranging from 15% to 58%) while maintaining a "lossless" semantic state^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. This approach contrasts with raw text summarization, as the original meaning and "unpredictable" facts remain fully intact for the model to process^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].
Related Concepts¶
- Caveman Compression
- [[Token 优化]]
- [[Context Window]]
Sources¶
001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md