mattmalcher mattmalcher2 commited on
Commit
638d86f
1 Parent(s): db37fcc

Update README.md (#1)

Browse files

- Update README.md (c37a7bcc1365f129054a18a8a091ca53cf49d550)


Co-authored-by: Matthew Malcher <mattmalcher2@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -1,3 +1,63 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ ---
4
+
5
+ I wanted to be able to go from the Meta model weights to an AWQ quantised model myself,
6
+ rather than grab the weights from [casperhansen](https://huggingface.co/casperhansen/llama-3-70b-instruct-awq) or elsewhere.
7
+
8
+ # First Attempt
9
+
10
+ Initially I tried running autoawq on an aws g5.12xlarge instance (4xA10), ubuntu 22, cuda 12.2, nvidia 535.113.01 drivers.
11
+
12
+ I tried different combinations of torch (2.1.2, 2.2.2), autoawq (0.2.4, 0.2.5) and transformers (4.38.2, 4.41.2), but I couldnt get it to work, even with the 8B model, (which all below errors are for). I kept getting errors like:
13
+
14
+ * 0.2.4 4.38.2 2.1.2, No device map, failed at 3% `RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)`
15
+ * 0.2.4 4.38.2 2.1.2, device map `"auto"` failed at 16% `& index < sizes[i] && "index out of bounds"` failed.`
16
+ * 0.2.5 4.40.0 2.1.2, No device map failed at 3% `File "{redacted}/.venv/lib/python3.11/site-packages/awq/quantize/quantizer.py"", line 69, in pseudo_quantize_tensor assert torch.isnan(w).sum() == 0"`
17
+ * 0.2.4 4.38.2 2.2.2, No device map failed at 3% `RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)`
18
+
19
+
20
+ The only thing that worked was setting CUDA_VISIBLE_DEVICES=0 or to a single device, but this would not work for the 70B model (vram)
21
+ Though the comment from casper [here](https://github.com/casper-hansen/AutoAWQ/issues/450#issuecomment-2065870629) makes me think quantising llama 3 70B with multiple GPUs should be possible.
22
+
23
+ # Working Approach
24
+
25
+ The following worked for me:
26
+
27
+ Machine: vast.ai 2xA100 PCIE instance with AMD EPYC 9554, CUDA 12.2 (~ half the price of the g5.12x large!)
28
+ Container: `pytorch:2.2.0-cuda12.1-cudnn8-devel` image
29
+
30
+ AutoAWQ @ [5f3785dc](https://github.com/casper-hansen/AutoAWQ/commit/5f3785dcaa107ca76f5fa5355f459370c86f82d6)
31
+
32
+
33
+ Followed commands in the readme:
34
+ ```
35
+ git clone https://github.com/casper-hansen/AutoAWQ
36
+ cd AutoAWQ
37
+ pip install -e .
38
+ ```
39
+
40
+ Installed vim to edit the example script: `apt install vim`, `vi examples/quantize.py`
41
+
42
+ Changed model path to:
43
+ `meta-llama/Meta-Llama-3-70B-Instruct`
44
+
45
+ Changed output path to:
46
+ `Meta-Llama-3-70B-Instruct-awq`
47
+
48
+ Used a script to set the token so we can pull llama 3
49
+
50
+ ```
51
+ #!/usr/bin/env bash
52
+
53
+ export HF_TOKEN=${your token here - used to grab llama weights}
54
+
55
+ python quantize.py
56
+ ```
57
+
58
+ This worked, took ~ 100 mins for the 70B model to quantise. Not sure if the second A100 was used,
59
+ once I set the thing running I couldnt figure out how to open a second ssh session to run nvidia-smi
60
+ or similar without joining the same tmux session running the quantisation, so just left it to it.
61
+
62
+
63
+