hugging-quants
/

Meta-Llama-3.1-405B-Instruct-AWQ-INT4

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Xenova HF staff commited on Jul 23

Commit

c15b810

•

1 Parent(s): c4f9132

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -79,7 +79,7 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 In order to run the inference with Llama 3.1 405B Instruct AWQ in INT4, both `torch` and `autoawq` need to be installed as:
 ```bash
-pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
 ```
 Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
@@ -107,7 +107,6 @@ model = AutoAWQForCausalLM.from_pretrained(
   torch_dtype=torch.float16,
   low_cpu_mem_usage=True,
   device_map="auto",
-  fuse_layers=True,
 )
 inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to('cuda')

 In order to run the inference with Llama 3.1 405B Instruct AWQ in INT4, both `torch` and `autoawq` need to be installed as:
 ```bash
+pip install "torch>=2.2.0,<2.3.0" torchvision autoawq --upgrade
 ```
 Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
   torch_dtype=torch.float16,
   low_cpu_mem_usage=True,
   device_map="auto",
 )
 inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to('cuda')