File size: 2,690 Bytes
c37a7bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: llama3
---

I wanted to be able to go from the Meta model weights to an AWQ quantised model myself,
rather than grab the weights from [casperhansen](https://huggingface.co/casperhansen/llama-3-70b-instruct-awq) or elsewhere.

# First Attempt

Initially I tried running autoawq on an aws g5.12xlarge instance (4xA10), ubuntu 22, cuda 12.2, nvidia 535.113.01 drivers.

I tried different combinations of torch (2.1.2, 2.2.2), autoawq (0.2.4, 0.2.5) and transformers (4.38.2, 4.41.2), but I couldnt get it to work, even with the 8B model, (which all below errors are for). I kept getting errors like:

* 0.2.4	4.38.2 2.1.2, No device map, failed at 3% `RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)`
* 0.2.4	4.38.2 2.1.2, device map `"auto"`	failed at 16%	`& index < sizes[i] && "index out of bounds"` failed.`
* 0.2.5	4.40.0 2.1.2, No device map		failed at 3%	`File "{redacted}/.venv/lib/python3.11/site-packages/awq/quantize/quantizer.py"", line 69, in pseudo_quantize_tensor  assert torch.isnan(w).sum() == 0"`
* 0.2.4	4.38.2	2.2.2, No device map		failed at 3%    `RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)`

  
The only thing that worked was setting CUDA_VISIBLE_DEVICES=0 or to a single device, but this would not work for the 70B model (vram)
Though the comment from casper [here](https://github.com/casper-hansen/AutoAWQ/issues/450#issuecomment-2065870629) makes me think quantising llama 3 70B with multiple GPUs should be possible.

# Working Approach

The following worked for me:

Machine: vast.ai 2xA100 PCIE instance with AMD EPYC 9554, CUDA 12.2 (~ half the price of the g5.12x large!)
Container: `pytorch:2.2.0-cuda12.1-cudnn8-devel` image

AutoAWQ @ [5f3785dc](https://github.com/casper-hansen/AutoAWQ/commit/5f3785dcaa107ca76f5fa5355f459370c86f82d6) 


Followed commands in the readme:
```
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
```

Installed vim to edit the example script: `apt install vim`, `vi examples/quantize.py`

Changed model path to:
`meta-llama/Meta-Llama-3-70B-Instruct`

Changed output path to:
`Meta-Llama-3-70B-Instruct-awq`
 
Used a script to set the token so we can pull llama 3

```
#!/usr/bin/env bash

export HF_TOKEN=${your token here - used to grab llama weights}

python quantize.py
```

This worked, took ~ 100 mins for the 70B model to quantise. Not sure if the second A100 was used, 
once I set the thing running I couldnt figure out how to open a second ssh session to run nvidia-smi 
or similar without joining the same tmux session running the quantisation, so just left it to it.