mattmalcher2
commited on
Commit
•
c37a7bc
1
Parent(s):
db37fcc
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,63 @@
|
|
1 |
-
---
|
2 |
-
license: llama3
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3
|
3 |
+
---
|
4 |
+
|
5 |
+
I wanted to be able to go from the Meta model weights to an AWQ quantised model myself,
|
6 |
+
rather than grab the weights from [casperhansen](https://huggingface.co/casperhansen/llama-3-70b-instruct-awq) or elsewhere.
|
7 |
+
|
8 |
+
# First Attempt
|
9 |
+
|
10 |
+
Initially I tried running autoawq on an aws g5.12xlarge instance (4xA10), ubuntu 22, cuda 12.2, nvidia 535.113.01 drivers.
|
11 |
+
|
12 |
+
I tried different combinations of torch (2.1.2, 2.2.2), autoawq (0.2.4, 0.2.5) and transformers (4.38.2, 4.41.2), but I couldnt get it to work, even with the 8B model, (which all below errors are for). I kept getting errors like:
|
13 |
+
|
14 |
+
* 0.2.4 4.38.2 2.1.2, No device map, failed at 3% `RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)`
|
15 |
+
* 0.2.4 4.38.2 2.1.2, device map `"auto"` failed at 16% `& index < sizes[i] && "index out of bounds"` failed.`
|
16 |
+
* 0.2.5 4.40.0 2.1.2, No device map failed at 3% `File "{redacted}/.venv/lib/python3.11/site-packages/awq/quantize/quantizer.py"", line 69, in pseudo_quantize_tensor assert torch.isnan(w).sum() == 0"`
|
17 |
+
* 0.2.4 4.38.2 2.2.2, No device map failed at 3% `RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)`
|
18 |
+
|
19 |
+
|
20 |
+
The only thing that worked was setting CUDA_VISIBLE_DEVICES=0 or to a single device, but this would not work for the 70B model (vram)
|
21 |
+
Though the comment from casper [here](https://github.com/casper-hansen/AutoAWQ/issues/450#issuecomment-2065870629) makes me think quantising llama 3 70B with multiple GPUs should be possible.
|
22 |
+
|
23 |
+
# Working Approach
|
24 |
+
|
25 |
+
The following worked for me:
|
26 |
+
|
27 |
+
Machine: vast.ai 2xA100 PCIE instance with AMD EPYC 9554, CUDA 12.2 (~ half the price of the g5.12x large!)
|
28 |
+
Container: `pytorch:2.2.0-cuda12.1-cudnn8-devel` image
|
29 |
+
|
30 |
+
AutoAWQ @ [5f3785dc](https://github.com/casper-hansen/AutoAWQ/commit/5f3785dcaa107ca76f5fa5355f459370c86f82d6)
|
31 |
+
|
32 |
+
|
33 |
+
Followed commands in the readme:
|
34 |
+
```
|
35 |
+
git clone https://github.com/casper-hansen/AutoAWQ
|
36 |
+
cd AutoAWQ
|
37 |
+
pip install -e .
|
38 |
+
```
|
39 |
+
|
40 |
+
Installed vim to edit the example script: `apt install vim`, `vi examples/quantize.py`
|
41 |
+
|
42 |
+
Changed model path to:
|
43 |
+
`meta-llama/Meta-Llama-3-70B-Instruct`
|
44 |
+
|
45 |
+
Changed output path to:
|
46 |
+
`Meta-Llama-3-70B-Instruct-awq`
|
47 |
+
|
48 |
+
Used a script to set the token so we can pull llama 3
|
49 |
+
|
50 |
+
```
|
51 |
+
#!/usr/bin/env bash
|
52 |
+
|
53 |
+
export HF_TOKEN=${your token here - used to grab llama weights}
|
54 |
+
|
55 |
+
python quantize.py
|
56 |
+
```
|
57 |
+
|
58 |
+
This worked, took ~ 100 mins for the 70B model to quantise. Not sure if the second A100 was used,
|
59 |
+
once I set the thing running I couldnt figure out how to open a second ssh session to run nvidia-smi
|
60 |
+
or similar without joining the same tmux session running the quantisation, so just left it to it.
|
61 |
+
|
62 |
+
|
63 |
+
|