Language Modeling Is Compression
It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
"while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%)" <-- wow.
Summary By LLM:
This paper discusses the connection between language modeling and lossless data compression. Some key points:
There is an equivalence between predictive modeling and lossless compression, based on information theory principles like Shannon's source coding theorem. Maximizing the log-likelihood of a model is equivalent to minimizing the expected code length when using that model for compression via arithmetic coding.
Modern large language models like Transformers, when used with arithmetic coding, can act as powerful general-purpose compressors. The authors show that models like Chinchilla, even when trained only on text, can achieve state-of-the-art compression rates on datasets like ImageNet images and LibriSpeech audio when used in an offline, in-context setting. This demonstrates their strong in-context learning capabilities.
However, model size matters for compression. When taking the model size into account, there are diminishing returns in compression performance as models scale up over a fixed dataset size. The authors show there is an optimal model-dataset size tradeoff.
Tokenization acts as a form of compression and helps Transformers by packing more information into a fixed context length. But simpler tokenizers like ASCII lead to better final compression rates, illustrating a tradeoff.
The prediction-compression equivalence allows using any compressor (like gzip) as a conditional generative model. The authors visualize samples from gzip vs a Transformer this way.
In summary, the paper provides a compression viewpoint on language modeling and large models, demonstrating connections to in-context learning, scaling laws, tokenization, and generation. Framing prediction as compression encompasses generalization and provides a useful lens.
I am not sure it's useful to compare even a lossy compression to what AI models are doing which are just straight up deleting content (and possibly making up content to fill in the gaps when needed). 100% of the training data isn't usably available in the finished model.
conclusion: dataset size and optimal model size are inextricably linked. No more scaling up, without more data.
Perhaps this is why GPT5 isn't planned. They need data, not GPU's
I feel the need to point out three issues with the paper.
First, much of the power of PNG doesn't actually show up when used on grayscale, 8-bit color images.
Notice how a modern gzip encoder can greatly outperform a typical PNG encoder's defaults (despite PNG also using DEFLATE) on grayscale images; and how even the obscure (though state of the art) PNG encoder I used to get the best grayscale PNG results can't do that much better than gzip.
So when the paper goes with the argument that PNG is a good example of a specialized codec (even ignoring that it's 26 years old and obsolete), bare in mind that they specifically put the encoder into a mode where the specialized components don't actually do much, and it's gzip doing most of the heavy lifting.
Worse, I don't think the paper actually makes clear how they encoded the picture. PNG encoder performance can vary a lot, and lots of tools to optimize pngs, like zopflipng, if used, have a subtle failure mode with grayscale images where they'll internally convert to 3-channels, find that they can't do better than the original input (which could be way smaller if not for the 3-channel conversion), and declare that your input can't be made smaller. It's hard to tell whether errors like that have or haven't been made without getting more details on the procedures.
Finally, while my area of expertise is far closer to image coding than audio coding, I am under the impression that FLAC specialized far more toward fulfilling the following goals:
- Being unencumbered by patents and licensing issues
- Being a useful audio conteiner
- Being extremely simple to write an ultra efficient hardware decoder for, without the issues e.g. Opus has where small bit differences are allowed based on implementation.
So, the specialization was less toward highly efficient compression in the storage sense (that's what Xiph worked on Opus for!) and more in the energy efficiency, licensing, and silicon-budget sense. Remember, FLAC has to be decoded by cheap audio players with short battery lives, while still being seekable, capable of presenting stereo, etc., and without infringing on patents.
Since there has never been that much interest in storage-efficient lossless audio, the impression this paper gives about FLAC might come off as misleading.
Finally, are the LZMA2 and gzip parameters missing? Those make a huge difference. For sure, the citation year on LZMA2 is off.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Approximating Human-Like Few-shot Learning with GPT-based Compression (2023)
- MEMORY-VQ: Compression for Tractable Internet-Scale Memory (2023)
- Headless Language Models: Learning without Predicting with Contrastive Weight Tying (2023)
- Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning (2023)
- The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper