Skip to content

Grammar-preserving compression rules

Grammar-preserving compression rules are a set of linguistic heuristics used to reduce token count in text inputs for Large Language Models (LLMs) while maintaining semantic fidelity^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md]. The core principle relies on the observation that LLMs can reliably reconstruct predictable syntactic elements, such as articles, auxiliary verbs, and complex conjunctions, whereas the actual semantic value resides in unpredictable facts, entities, and constraints^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md].

This approach allows for significant reduction in context window usage—typically between 15% and 58%—without losing the information required for reasoning or instruction following^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Core Principles

The methodology distinguishes between content that is predictable for an LLM (grammar) and content that is not (facts). By removing the former, the text becomes more dense but remains comprehensible to both the model and human readers^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

  • Predictability: LLMs are trained on vast corpora and can infer missing grammar. Therefore, standard syntactic glue words are redundant in prompts^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Information Density: The goal is to retain "unpredictable" information—specifically numbers, proper names, technical constraints, and logic-defining terms^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Syntactic Removal

These rules identify categories of words that act primarily as structural filler and can be excised^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]:

  • Grammar Words: Definite and indefinite articles (a, an, the) and basic forms of the verb to be (is, are, was, were)^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Connectives: Logical transition words like therefore, however, because, and phrases like in order to^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Passive Voice Markers: Phrases indicating passive voice (e.g., is calculated by) are replaced with direct actions^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Filler Words: Qualifiers that add little semantic weight, such as very, quite, or essentially, are removed^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Stylistic Constraints

To ensure the compressed text remains machine-readable and unambiguous, specific stylistic rules are applied^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]:

  • Sentence Length: Restrict sentences to a short range of words (typically 2-5 words). This ensures that each sentence conveys a single "atomic thought," reducing the cognitive load on the parser^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Active Voice: Use active verbs (e.g., calculate value) instead of passive constructions (e.g., value is calculated)^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Simple Verbs: Prefer concrete, common verbs (e.g., do, make, fix, check) over abstract corporate language (e.g., facilitate, optimize)^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Explicit Enumeration: Avoid ranges when specific values are meant. Instead of "test values 5-6", use "test five, test six" to prevent ambiguity^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Information Retention

The "preserving" aspect of these rules dictates exactly what must be kept to avoid data loss^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]:

  • Factual Data: All numbers, dates, and Metrics must be preserved^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Entities: Proper names (e.g., people, places, organizations) are retained^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].
  • Technical Constraints: Specific conditions (e.g., medium-large, 99.9% uptime, O(log n)) are deemed high-value content and are never compressed^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md].

Example Transformation

The following example demonstrates the application of these rules to a technical instruction^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]:

  • Original: "In order to optimize the database query performance, we should consider implementing an index on the frequently accessed columns..."
  • Compressed: "Need fast queries. Check which columns used most. Add index to those columns..."

In this example, connecting phrases are removed, passive voice is shifted to active imperatives, and complex phrasing is simplified to atomic instructions.

Sources

  • 001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md