arxiv:2606.14346

Squeeze-Release: Iterative Pruning with Exact Structural Minimization

Published on Jun 12

· Submitted by

Roman Denkin on Jun 15

Uppsala University

Upvote

Authors:

Roman Denkin ,

Abstract

Squeeze-Release compression method combines pruning with structural minimization to create significantly smaller neural networks while maintaining accuracy, extending to transformer architectures through CompensatedLayerNorm.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.

View arXiv page View PDF GitHub 0 Add to collection

Community

gluck3d

Paper author Paper submitter about 8 hours ago

Neural networks are often far bigger than they need to be, and "pruning" refers to removing the components that add little to a model's performance. The catch: the most common pruning methods report a large amount of disabled parameters, but the model you actually deploy is often no smaller, because the tensors keep their original/dense shape for better hardware compatibility (and this is how it is implemented in default PyTorch).
Our new preprint, "Squeeze-Release: Iterative Pruning with Exact Structural Minimization," closes that gap. We rebuild a pruned network as a genuinely smaller dense one with the same output, then iterate to keep finding redundancy a single pass would miss. In practice this compresses the deployable model up to ~39× on a fully-connected network and ~14.8× on ConvNeXt-Tiny, at comparable accuracy.
We also propose CompensatedLayerNorm - a modified LayerNorm which allows to prune connections going through LayerNorm in function preserving way.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.14346

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14346 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14346 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14346 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.