license: llama3
I wanted to be able to go from the Meta model weights to an AWQ quantised model myself, rather than grab the weights from casperhansen or elsewhere.
First Attempt
Initially I tried running autoawq on an aws g5.12xlarge instance (4xA10), ubuntu 22, cuda 12.2, nvidia 535.113.01 drivers.
I tried different combinations of torch (2.1.2, 2.2.2), autoawq (0.2.4, 0.2.5) and transformers (4.38.2, 4.41.2), but I couldnt get it to work, even with the 8B model, (which all below errors are for). I kept getting errors like:
- 0.2.4 4.38.2 2.1.2, No device map, failed at 3%
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
- 0.2.4 4.38.2 2.1.2, device map
"auto"
failed at 16%& index < sizes[i] && "index out of bounds"
failed.` - 0.2.5 4.40.0 2.1.2, No device map failed at 3%
File "{redacted}/.venv/lib/python3.11/site-packages/awq/quantize/quantizer.py"", line 69, in pseudo_quantize_tensor assert torch.isnan(w).sum() == 0"
- 0.2.4 4.38.2 2.2.2, No device map failed at 3%
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
The only thing that worked was setting CUDA_VISIBLE_DEVICES=0 or to a single device, but this would not work for the 70B model (vram) Though the comment from casper here makes me think quantising llama 3 70B with multiple GPUs should be possible.
Working Approach
The following worked for me:
Machine: vast.ai 2xA100 PCIE instance with AMD EPYC 9554, CUDA 12.2 (~ half the price of the g5.12x large!)
Container: pytorch:2.2.0-cuda12.1-cudnn8-devel
image
AutoAWQ @ 5f3785dc
Followed commands in the readme:
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
Installed vim to edit the example script: apt install vim
, vi examples/quantize.py
Changed model path to:
meta-llama/Meta-Llama-3-70B-Instruct
Changed output path to:
Meta-Llama-3-70B-Instruct-awq
Used a script to set the token so we can pull llama 3
#!/usr/bin/env bash
export HF_TOKEN=${your token here - used to grab llama weights}
python quantize.py
This worked, took ~ 100 mins for the 70B model to quantise. Not sure if the second A100 was used, once I set the thing running I couldnt figure out how to open a second ssh session to run nvidia-smi or similar without joining the same tmux session running the quantisation, so just left it to it.