Skip to content

Semantic token compression

Semantic token compression is a technique for optimizing [[Large Language Model]] inputs by removing predictable linguistic elements while preserving critical semantic information^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. This method operates on the principle that LLMs are capable of reliably reconstructing standard grammar, syntax, and cohesive structures during the inference process^[001-TODO__Caveman_Compression-LLM_语义压缩方法.md]. Consequently, by retaining only "unpredictable" content—such as specific data points, technical terms, and constraints—users can achieve significant reductions in token usage without a loss of factual integrity^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].

This approach is often exemplified by "Caveman Compression," which creates a compressed text style that resembles telegraphic speech but remains human-readable^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Core Principles

The fundamental strategy involves a trade-off between linguistic fluency and information density^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]. Instead of natural, flowing sentences, the compression output prioritizes atomic facts and explicit data.

  • Removal of Predictable Elements: The process strips out "function words" and syntactic glue that LLMs can predict with high probability^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]. This includes:
    • Articles and Grammar: "a", "the", "is", "are".
    • Connectors: "therefore", "however", "because", "in order to".
    • Passive Voice: Phrases like "is calculated by".
    • Filler Words: "very", "quite", "essentially"^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Preservation of Unpredictable Facts: The core content that must be retained includes:
    • Factual Data: Numbers, dates, names, and specific values.
    • Technical Terms: Domain-specific jargon (e.g., "O(log n)", "binary search").
    • Constraints: Specific conditions or limits (e.g., "medium-large", "frequently accessed")^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Structural Simplification:
    • Sentences are broken down to 2–5 words per atomic thought^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
    • Simple, active verbs are used ("do", "make", "fix") instead of abstract nominalizations ("facilitate", "optimize")^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Compression Methods

Implementation can vary based on the required compression rate, cost, and latency constraints. Three primary approaches are documented^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]:

Method Mechanism Compression Rate Cost Speed Privacy/Offline
LLM-based Uses an LLM API to rewrite text context-aware. 40–58% Paid (API) ~2s/request No
MLM-based Uses a Masked Language Model (e.g., RoBERTa) to remove top-k predictable tokens. 20–30% Free ~1–5s/doc Yes
NLP-based Uses rule-based NLP (e.g., spaCy) to strip grammatical categories. 15–30% Free <100ms Yes
  • LLM-based: Offers the highest compression and quality by understanding context, but requires an API key and incurs latency^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • MLM-based: Balances cost and performance by using local models to predict and remove redundant tokens^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • NLP-based: The fastest and most versatile method, supporting 15+ languages via rigid grammatical rules, though it typically achieves lower compression ratios compared to LLM-based methods^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Efficacy and Benchmarks

Testing indicates that semantic compression can reduce token counts by an average of 40% across various text types^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

  • System Prompts: High compression rates (up to 58%) are often achievable because system prompts frequently contain repetitive instructional language^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • API Documentation: Technical documentation, which is dense with facts but often padded with explanatory text, sees compression rates around 42%^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Fidelity: In benchmark tests involving fact retrieval, compressed text has demonstrated a 100% retention rate of critical facts (13/13), confirming that the "lossy" compression of syntax does not result in a loss of semantic meaning^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Applications

Semantic token compression is particularly beneficial for scenarios where context window size is a limiting factor or where input token costs are high^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

  • RAG Knowledge Bases: Compressing documents before storing them in a vector database allows for fitting more relevant context into the prompt window^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Chain-of-Thought (CoT) Reasoning: Agent reasoning chains and thinking blocks can be extremely verbose; compressing these internal monologues saves tokens without degrading the final output quality^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Internal Documentation: Technical wikis and instructions intended for AI consumption can be stored in a compressed format to optimize processing costs.
  • [[Prompt Engineering]]
  • [[Token 优化]]
  • [[RAG 系统]]

Sources

  • 001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md