Squeeze-Release: Iterative Pruning with Exact Structural Minimization
Abstract
Squeeze-Release compression method combines pruning with structural minimization to create significantly smaller neural networks while maintaining accuracy, extending to transformer architectures through CompensatedLayerNorm.
Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.
Community
Neural networks are often far bigger than they need to be, and "pruning" refers to removing the components that add little to a model's performance. The catch: the most common pruning methods report a large amount of disabled parameters, but the model you actually deploy is often no smaller, because the tensors keep their original/dense shape for better hardware compatibility (and this is how it is implemented in default PyTorch).
Our new preprint, "Squeeze-Release: Iterative Pruning with Exact Structural Minimization," closes that gap. We rebuild a pruned network as a genuinely smaller dense one with the same output, then iterate to keep finding redundancy a single pass would miss. In practice this compresses the deployable model up to ~39× on a fully-connected network and ~14.8× on ConvNeXt-Tiny, at comparable accuracy.
We also propose CompensatedLayerNorm - a modified LayerNorm which allows to prune connections going through LayerNorm in function preserving way.
Get this paper in your agent:
hf papers read 2606.14346 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper