amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm

Introduction

This is vllm-compatible fp8 ptq model based on Meta-Llama-3.1-405B-Instruct. For detailed quantization scheme, refer to the official documentation of AMD Quark 0.2.0 quantizer.

Quickstart

To run this fp8 model on vLLM framework,

Modle Preparation

build the rocm-vllm docker image by using this dockerfile and launch a vllm docker container.

docker build -f Dockerfile.rocm -t vllm_test .
docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest

clone the baseline Meta-Llama-3.1-405B-Instruct.
clone this fp8 model and inside the fp8 model folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors.

python merge.py

once the merged llama.safetensors is created, move this file and llama.json to the saved directory of Meta-Llama-3.1-405B-Instruct by this command. Model snapshot commit# 069992c75aed59df00ec06c17177e76c63296a26 can be different.

cp llama.json ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.

Running fp8 model

# 8 GPUs
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py

# run_vllm_fp8.py
from vllm import LLM, SamplingParams
prompt = "Write me an essay about bear and knight"

model_name="models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
tp=8 # 8 GPUs

model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
sampling_params = SamplingParams(
                  top_k=1.0,
                  ignore_eos=True,
                  max_tokens=200,
                  )
result = model.generate(prompt, sampling_params=sampling_params)
print(result)

Running fp16 model (For comparison)

# 8 GPUs
torchrun --standalone --nproc_per_node=8 run_vllm_fp16.py

# run_vllm_fp16.py
from vllm import LLM, SamplingParams
prompt = "Write me an essay about bear and knight"

model_name="models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
tp=8 # 8 GPUs
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
sampling_params = SamplingParams(
                  top_k=1.0,
                  ignore_eos=True,
                  max_tokens=200,
                  )
result = model.generate(prompt, sampling_params=sampling_params)
print(result)

fp8 gemm_tuning

Will update soon.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.