Semantic token compression¶

Semantic token compression is a technique for optimizing [[Large Language Model]] inputs by removing predictable linguistic elements while preserving critical semantic information^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. This method operates on the principle that LLMs are capable of reliably reconstructing standard grammar, syntax, and cohesive structures during the inference process^[001-TODO__Caveman_Compression-LLM_语义压缩方法.md]. Consequently, by retaining only "unpredictable" content—such as specific data points, technical terms, and constraints—users can achieve significant reductions in token usage without a loss of factual integrity^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].

This approach is often exemplified by "Caveman Compression," which creates a compressed text style that resembles telegraphic speech but remains human-readable^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Core Principles¶

The fundamental strategy involves a trade-off between linguistic fluency and information density^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]. Instead of natural, flowing sentences, the compression output prioritizes atomic facts and explicit data.

Removal of Predictable Elements: The process strips out "function words" and syntactic glue that LLMs can predict with high probability^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]. This includes:
- Articles and Grammar: "a", "the", "is", "are".
- Connectors: "therefore", "however", "because", "in order to".
- Passive Voice: Phrases like "is calculated by".
- Filler Words: "very", "quite", "essentially"^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Preservation of Unpredictable Facts: The core content that must be retained includes:
- Factual Data: Numbers, dates, names, and specific values.
- Technical Terms: Domain-specific jargon (e.g., "O(log n)", "binary search").
- Constraints: Specific conditions or limits (e.g., "medium-large", "frequently accessed")^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Structural Simplification:
- Sentences are broken down to 2–5 words per atomic thought^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
- Simple, active verbs are used ("do", "make", "fix") instead of abstract nominalizations ("facilitate", "optimize")^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Compression Methods¶

Implementation can vary based on the required compression rate, cost, and latency constraints. Three primary approaches are documented^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]:

Method	Mechanism	Compression Rate	Cost	Speed	Privacy/Offline
LLM-based	Uses an LLM API to rewrite text context-aware.	40–58%	Paid (API)	~2s/request	No
MLM-based	Uses a Masked Language Model (e.g., RoBERTa) to remove top-k predictable tokens.	20–30%	Free	~1–5s/doc	Yes
NLP-based	Uses rule-based NLP (e.g., spaCy) to strip grammatical categories.	15–30%	Free	<100ms	Yes

LLM-based: Offers the highest compression and quality by understanding context, but requires an API key and incurs latency^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
MLM-based: Balances cost and performance by using local models to predict and remove redundant tokens^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
NLP-based: The fastest and most versatile method, supporting 15+ languages via rigid grammatical rules, though it typically achieves lower compression ratios compared to LLM-based methods^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Efficacy and Benchmarks¶

Testing indicates that semantic compression can reduce token counts by an average of 40% across various text types^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

System Prompts: High compression rates (up to 58%) are often achievable because system prompts frequently contain repetitive instructional language^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
API Documentation: Technical documentation, which is dense with facts but often padded with explanatory text, sees compression rates around 42%^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Fidelity: In benchmark tests involving fact retrieval, compressed text has demonstrated a 100% retention rate of critical facts (13/13), confirming that the "lossy" compression of syntax does not result in a loss of semantic meaning^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Applications¶

Semantic token compression is particularly beneficial for scenarios where context window size is a limiting factor or where input token costs are high^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

RAG Knowledge Bases: Compressing documents before storing them in a vector database allows for fitting more relevant context into the prompt window^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Chain-of-Thought (CoT) Reasoning: Agent reasoning chains and thinking blocks can be extremely verbose; compressing these internal monologues saves tokens without degrading the final output quality^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
Internal Documentation: Technical wikis and instructions intended for AI consumption can be stored in a compressed format to optimize processing costs.

[[Prompt Engineering]]
[[Token 优化]]
[[RAG 系统]]

Sources¶

001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md