Create README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,218 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# SuperBlock
|
2 |
+
|
3 |
+
SuperBlock combines two techniques for efficient neural network training and inference: Supermask and Block Compressed Sparse Row (BSR)
|
4 |
+
|
5 |
+
### Supermask
|
6 |
+
[Supermask](https://arxiv.org/abs/2207.00670) is a technique for applying structured sparsity to neural networks using a learned mask. It works by learning a continuous mask (scores) that is applied element-wise to the weights of a neural network layer. The mask scores are learned separately from the weights and are thresholded based on a target sparsity level to obtain a binary mask. The mask determines which weigths are kept and which are pruned, and is learned during training.
|
7 |
+
|
8 |
+
During inference, the binary mask is applied element-wise to the weights, pruning the weights that correspond to a 0 in the mask, resulting in a sparse network that can be efficiently computed.
|
9 |
+
|
10 |
+
### Block compressed Sparse Row Format (BSR)
|
11 |
+
[The BSR format](https://pytorch.org/docs/main/sparse.html#sparse-bsr-tensor) is a sparse matrix representation that stores dense sub-blocks of non-zero elements instead of individual non-zero elements. The matrix is divided into equal-sized blocks, and only the non-zero blocks are stored.
|
12 |
+
|
13 |
+
The BSR format is efficient for sparse matrices with a block structure, where non-zero elements tend to cluster in dense sub-blocks. It reduces storage requirements and enables efficient matrix operations on the non-zero blocks.
|
14 |
+
|
15 |
+
Currently, the BSR format is optimized for Nvidia A100 GPU(s) only.
|
16 |
+
|
17 |
+
## Setup
|
18 |
+
To use SuperBlock, you will need
|
19 |
+
* [PyTorch](https://pytorch.org/get-started/locally/)
|
20 |
+
|
21 |
+
To train the model or evaluate accuracy, you will need:
|
22 |
+
* ImageNet2012-blurred dataset
|
23 |
+
|
24 |
+
At least one GPU:
|
25 |
+
* A100 or H100
|
26 |
+
|
27 |
+
## Installation
|
28 |
+
* Clone this repo
|
29 |
+
```
|
30 |
+
git clone https://github.com/pytorch-labs/superblock.git
|
31 |
+
cd superblock
|
32 |
+
```
|
33 |
+
* Create a new conda environment
|
34 |
+
```
|
35 |
+
conda create -n superblock
|
36 |
+
conda activate superblock
|
37 |
+
```
|
38 |
+
* Install PyTorch. For best performance, we recommend `2.3.0.dev20240305+cu121` nightly
|
39 |
+
```
|
40 |
+
pip install --pre torch==2.3.0.dev20240305+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
|
41 |
+
pip install --pre torchvision==0.18.0 --no-deps
|
42 |
+
```
|
43 |
+
|
44 |
+
|
45 |
+
## Benchmarking
|
46 |
+
Baseline:
|
47 |
+
```
|
48 |
+
python benchmark.py \
|
49 |
+
--model vit_b_16 \
|
50 |
+
--batch-size 256 \
|
51 |
+
> /dev/null
|
52 |
+
```
|
53 |
+
Result:
|
54 |
+
```
|
55 |
+
532.1160546875 ms
|
56 |
+
```
|
57 |
+
|
58 |
+
|
59 |
+
80% sparsity, block size 64 (random weights):
|
60 |
+
```
|
61 |
+
python benchmark.py --model vit_b_16 \
|
62 |
+
--batch-size 256 \
|
63 |
+
--sparsity-linear 0.8 \
|
64 |
+
--sp-linear-tile-size 64 \
|
65 |
+
--sparsify-weights \
|
66 |
+
--bsr 64 \
|
67 |
+
> /dev/null
|
68 |
+
```
|
69 |
+
Result:
|
70 |
+
```
|
71 |
+
393.864453125 ms
|
72 |
+
```
|
73 |
+
|
74 |
+
|
75 |
+
## Training
|
76 |
+
Please refer to [TRAINING.md](TRAINING.md) for training from scratch. We use [Torchvision](https://github.com/pytorch/vision/tree/main/references/classification) as our framework for training. Supermask can be applied during training.
|
77 |
+
|
78 |
+
To apply supermask, we have the following arguments at our disposal,
|
79 |
+
|
80 |
+
* Apply Supermask to linear layers:
|
81 |
+
```
|
82 |
+
--sparsity-linear
|
83 |
+
--sp-linear-tile-size
|
84 |
+
```
|
85 |
+
* Apply Supermask to conv1x1 layers:
|
86 |
+
```
|
87 |
+
--sparsity-conv1x1
|
88 |
+
--sp-conv1x1-tile-size
|
89 |
+
```
|
90 |
+
* Apply Supermask to all other convolutional layers:
|
91 |
+
```
|
92 |
+
--sparsity-conv
|
93 |
+
--sp-conv-tile-size
|
94 |
+
```
|
95 |
+
* Skip the first transformer layer and/or last linear layer (ViT only):
|
96 |
+
```
|
97 |
+
--skip-last-layer-sparsity
|
98 |
+
--skip-first-transformer-sparsity
|
99 |
+
```
|
100 |
+
|
101 |
+
For example, if you would like to train a `vit_b_16` from scratch using Supermask, you can use the respective torchvision command found in [TRAINING.md](TRAINING.md) and append the supermask arguments:
|
102 |
+
```
|
103 |
+
torchrun --nproc_per_node=8 train.py\
|
104 |
+
--model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
|
105 |
+
--lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
|
106 |
+
--lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
|
107 |
+
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema\
|
108 |
+
--sparsity-linear 0.9 --sp-linear-tile-size 32
|
109 |
+
```
|
110 |
+
Through this command, we are training a `vit_b_16` with 90% sparsity to linear layers using 32x32 tiles.
|
111 |
+
|
112 |
+
Please run `python train.py --help` for a full list of available arguments.
|
113 |
+
|
114 |
+
## Evaluation
|
115 |
+
|
116 |
+
To run an evaluation of a Supermask-trained model, you can use [evaluate.py](evaluate.py). Our current version has signficant speedup with float32 only and not float16, hence, to illustrate speedup, we don't pass `--amp` in the example commands below.
|
117 |
+
|
118 |
+
```
|
119 |
+
MODEL_PATH=<put the path of the trained checkpoint here>
|
120 |
+
IMAGENET_PATH=<put the path of ImageNet dataset here>
|
121 |
+
NGPUS=1 # put number of available GPUS here
|
122 |
+
```
|
123 |
+
|
124 |
+
* Offline sparsification with BSR:
|
125 |
+
```
|
126 |
+
torchrun --nproc_per_node=${NGPUS} evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear 0.9 --sp-linear-tile-size 32 --weights-path ${MODEL_PATH} --data-path ${IMAGENET_PATH} --sparsify-weights --bsr 32
|
127 |
+
```
|
128 |
+
This command applies 90% sparsity to linear layers using 32x32 tiles, loads the model weights from ${MODEL_PATH}, loads the ImageNet validation set located at the specified path, applies offline sparsification to the weights, and converts the sparse weights to BSR format with a block size of 32. It is recommended to set `--bsr` the same as tile size.
|
129 |
+
|
130 |
+
* Online sparsification without BSR:
|
131 |
+
```
|
132 |
+
torchrun --nproc_per_node=${NGPUS} evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear 0.9 --sp-linear-tile-size 32 --weights-path ${MODEL_PATH} --data-path ${IMAGENET_PATH}
|
133 |
+
```
|
134 |
+
This is similar to the previous command, but it does not apply offline sparsification or BSR conversion. Instead, the sparsity is applied on-the-fly during evaluation.
|
135 |
+
|
136 |
+
Please run `python evaluate.py --help` for a full list of available arguments.
|
137 |
+
|
138 |
+
Results (1x A100):
|
139 |
+
* Baseline
|
140 |
+
```
|
141 |
+
Test: Total time: 0:02:11
|
142 |
+
Test: Acc@1 78.392 Acc@5 93.592
|
143 |
+
```
|
144 |
+
|
145 |
+
* Sparsity= 0.9, Tile Size = 32, Online Sparsification, BSR = None
|
146 |
+
```
|
147 |
+
Test: Total time: 0:01:52
|
148 |
+
Test: Acc@1 76.092 Acc@5 92.656
|
149 |
+
```
|
150 |
+
|
151 |
+
* Sparsity= 0.9, Tile Size = 32, Offline Sparsification, BSR = None
|
152 |
+
```
|
153 |
+
Test: Total time: 0:01:54
|
154 |
+
Test: Acc@1 76.092 Acc@5 92.656
|
155 |
+
```
|
156 |
+
|
157 |
+
* Sparsity= 0.9, Tile Size = 32, Offline Sparsification, BSR = 32
|
158 |
+
```
|
159 |
+
Test: Total time: 0:01:25
|
160 |
+
Test: Acc@1 76.092 Acc@5 92.656
|
161 |
+
```
|
162 |
+
|
163 |
+
## Pretrained Weights
|
164 |
+
|
165 |
+
### Download:
|
166 |
+
Instead of training from scratch, if you'd like to use the Supermask weights of `vit_b_16` trained on privacy mitigated Imagenet-blurred, you can download them here:
|
167 |
+
```
|
168 |
+
SPARSITY=0.80 # Checkpoints available for: 0.70, 0.80, 0.82, 0.84, 0.86, 0.88, 0.90
|
169 |
+
BLOCK_SIZE=32 # Checkpoints available for: 16, 32, 64
|
170 |
+
```
|
171 |
+
|
172 |
+
```
|
173 |
+
mkdir checkpoints
|
174 |
+
# For baseline,
|
175 |
+
wget https://huggingface.co/facebook/superblock-vit-b-16/resolve/main/checkpoints/baseline.pth -P checkpoints/
|
176 |
+
# For sparsified checkpoints,
|
177 |
+
wget https://huggingface.co/facebook/superblock-vit-b-16/resolve/main/checkpoints/sp${SPARSITY}-ts${BLOCK_SIZE}.pth -P checkpoints/
|
178 |
+
```
|
179 |
+
|
180 |
+
### Benchmark:
|
181 |
+
```
|
182 |
+
python benchmark.py --model vit_b_16 \
|
183 |
+
--batch-size 256 \
|
184 |
+
--sparsity-linear ${SPARSITY} \
|
185 |
+
--sp-linear-tile-size ${BLOCK_SIZE} \
|
186 |
+
--sparsify-weights \
|
187 |
+
--bsr ${BLOCK_SIZE} \
|
188 |
+
--weights-path ./checkpoints/superblock-vit-b-16-sp${SPARSITY}-ts${BLOCK_SIZE}.pth \
|
189 |
+
> /dev/null
|
190 |
+
```
|
191 |
+
Result:
|
192 |
+
```
|
193 |
+
530.342578125 ms
|
194 |
+
```
|
195 |
+
|
196 |
+
### Evaluate:
|
197 |
+
8 x A100 GPUs:
|
198 |
+
```
|
199 |
+
torchrun --nproc_per_node=8 evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear ${SPARSITY} --sp-linear-tile-size ${BLOCK_SIZE} --bsr ${BLOCK_SIZE} --sparsify-weights --weights-path checkpoints/superblock-vit-b-16-sp${SPARSITY}-ts${BLOCK_SIZE}.pth --data-path ${IMAGENET_PATH}
|
200 |
+
```
|
201 |
+
Result:
|
202 |
+
```
|
203 |
+
Test: Total time: 0:01:01
|
204 |
+
Test: Acc@1 77.644 Acc@5 93.554
|
205 |
+
```
|
206 |
+
|
207 |
+
1 x A100 GPUs:
|
208 |
+
```
|
209 |
+
torchrun --nproc_per_node=1 evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear ${SPARSITY} --sp-linear-tile-size ${BLOCK_SIZE} --bsr ${BLOCK_SIZE} --sparsify-weights --weights-path checkpoints/superblock-vit-b-16-sp${SPARSITY}-ts${BLOCK_SIZE}.pth --data-path ${IMAGENET_PATH}
|
210 |
+
```
|
211 |
+
Result:
|
212 |
+
```
|
213 |
+
Test: Total time: 0:01:51
|
214 |
+
Test: Acc@1 77.644 Acc@5 93.554
|
215 |
+
```
|
216 |
+
|
217 |
+
## License
|
218 |
+
SuperBlock is released under the [MIT license](https://github.com/pytorch-labs/superblock?tab=MIT-1-ov-file#readme).
|