MLM-based compression¶
MLM-based compression is a text optimization technique designed to reduce the number of tokens used in Large Language Model (LLM) contexts while preserving semantic meaning.^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md] It utilizes Masked Language Models (MLMs), such as RoBERTa, to identify and remove text elements that are highly predictable, thereby keeping only the unpredictable, information-dense content.^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md]
This approach is one of three specific methods implemented within the "Caveman Compression" framework, offering a balance between compression rate, processing cost, and privacy by operating locally.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Mechanism¶
The core principle of MLM-based compression relies on the predictive capabilities of Masked Language Models.^[001-TODO__Caveman_Compression_-LLM_语义压缩方法.md] By analyzing the input text, the model determines the likelihood (probability) of specific tokens appearing in a given context.^[001-TODO__Caveman_Compression-_LLM_语义压缩方法.md]
- Predictability Analysis: The MLM assigns a probability score to each token, indicating how predictable it is based on its surrounding context.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Selective Removal: The algorithm identifies the "top-k" most predictable tokens (e.g., the most predictable 30% of words) and removes them.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Semantic Retention: Because LLMs are inherently good at predicting syntactic gaps and common language patterns, they can reliably reconstruct the removed grammar and structure during inference, effectively "decompressing" the text internally.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Characteristics¶
Compared to other compression strategies, MLM-based compression offers a specific set of advantages and trade-offs:
- Cost: It is free to operate, as it runs locally rather than relying on paid API endpoints.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Performance: Compression rates typically range from 20% to 30%.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Speed: Processing is relatively fast, taking approximately 1–5 seconds per document depending on hardware.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Privacy: The method is fully offline, ensuring that data does not need to be sent to external servers.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Resource Requirements: It requires a local model download (approximately 500MB).^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
- Language Support: Primarily supports English.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Usage¶
To use MLM-based compression, specifically within the Caveman Compression framework, the following dependencies and commands are used:^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
pip install -r requirements-mlm.txt
python -m spacy download en_core_web_sm
python caveman_compress_mlm.py compress "Your verbose text here"
python caveman_compress_mlm.py compress -f input.txt -k 30
The -k parameter allows users to adjust the compression level by controlling the threshold for token removal.^[001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md]
Related Concepts¶
- Caveman Compression: The overarching framework and methodology for semantic compression.
- LLM-based compression: An alternative method that offers higher compression rates (40-58%) but requires API usage.
- NLP-based compression: A rule-based alternative that supports more languages but offers lower compression quality.
Sources¶
001-TODO__Caveman_Compression_-_LLM_语义压缩方法.md