The meta-llama/Llama-3.1-70B-Instruct model has been quantized using AutoRound and serialized in the GPTQ format at 4-bit precision.
This process achieved an impressive 70% reduction in model size while retaining 99% of its original accuracy, ensuring both efficiency and precision for real-world applications.

How to run

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq"
model = AutoModelForCausalLM.from_pretrained(quantized_model,
                                             device_map="auto")
                                             
tokenizer = AutoTokenizer.from_pretrained(quantized_model)
text = "The patient was admitted to the hospital"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0])

This quantization process was conducted by Sofya to make large-scale language models more accessible.

Downloads last month
52
Safetensors
Model size
11.3B params
Tensor type
BF16
I32
FP16
Inference API
Unable to determine this model's library. Check the docs .

Model tree for sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq

Quantized
(92)
this model
Adapters
1 model