MLM-based compression¶

MLM-based compression is a text optimization technique designed to reduce the number of tokens used in Large Language Model (LLM) contexts while preserving semantic meaning.^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md] It utilizes Masked Language Models (MLMs), such as RoBERTa, to identify and remove text elements that are highly predictable, thereby keeping only the unpredictable, information-dense content.^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md]

This approach is one of three specific methods implemented within the "Caveman Compression" framework, offering a balance between compression rate, processing cost, and privacy by operating locally.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]

Mechanism¶

The core principle of MLM-based compression relies on the predictive capabilities of Masked Language Models.^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md] By analyzing the input text, the model determines the likelihood (probability) of specific tokens appearing in a given context.^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md]

Predictability Analysis: The MLM assigns a probability score to each token, indicating how predictable it is based on its surrounding context.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Selective Removal: The algorithm identifies the "top-k" most predictable tokens (e.g., the most predictable 30% of words) and removes them.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Semantic Retention: Because LLMs are inherently good at predicting syntactic gaps and common language patterns, they can reliably reconstruct the removed grammar and structure during inference, effectively "decompressing" the text internally.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]

Characteristics¶

Compared to other compression strategies, MLM-based compression offers a specific set of advantages and trade-offs:

Cost: It is free to operate, as it runs locally rather than relying on paid API endpoints.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Performance: Compression rates typically range from 20% to 30%.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Speed: Processing is relatively fast, taking approximately 1–5 seconds per document depending on hardware.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Privacy: The method is fully offline, ensuring that data does not need to be sent to external servers.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Resource Requirements: It requires a local model download (approximately 500MB).^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Language Support: Primarily supports English.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]

Usage¶

To use MLM-based compression, specifically within the Caveman Compression framework, the following dependencies and commands are used:^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]

pip install -r requirements-mlm.txt
python -m spacy download en_core_web_sm
python caveman_compress_mlm.py compress "Your verbose text here"
python caveman_compress_mlm.py compress -f input.txt -k 30

The -k parameter allows users to adjust the compression level by controlling the threshold for token removal.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]

Caveman Compression: The overarching framework and methodology for semantic compression.
LLM-based compression: An alternative method that offers higher compression rates (40-58%) but requires API usage.
NLP-based compression: A rule-based alternative that supports more languages but offers lower compression quality.

Sources¶

001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md

MLM-based compression¶

Mechanism¶

Characteristics¶

Usage¶

Related Concepts¶

Sources¶