LLaMA 3.2 1B GPTQ Quantized Model
This repository hosts the LLaMA 3.2 1B Instruct model, quantized using the GPTQ (Groupwise Quantization for Transformers) method with the C4 dataset for calibration. This quantized version is optimized for faster inference, reduced memory consumption, and efficient deployment in resource-constrained environments.
Model Details
- Base Model: LLaMA 3.2 1B Instruct
- Quantization Method: GPTQ (4-bit groupwise quantization)
- Calibration Dataset: C4 Dataset
- File Format:
safetensors
for secure and efficient storage - Purpose: Fine-tuned for instruction-following tasks and quantized for efficient inference
How the Model Was Quantized
The quantization was performed using the GPTQQuantizer from the Optimum library, with the following parameters:
- Bits: 4 (for reduced memory footprint)
- Block Quantization: Applied to
model.layers
- Calibration Dataset: C4 dataset for accurate activation distribution
- Maximum Sequence Length: 2048 tokens
- Hardware Used: [2 NVDIA H100 GPUs]
Quantization Script
The quantization process involved:
- Model Loading: The model and tokenizer were loaded from Hugging Face.
- GPTQ Quantization: Using the GPTQQuantizer with specific configurations for sequence length, dataset, and model layers.
- Saving Artifacts: The quantized model and tokenizer were saved in the
quantized_model/
directory.
View the quantization script here ().
Intended Use
This model is designed for:
- Instruction-Following Tasks: Dialogue generation, question answering, and text completion
- Research: Ideal for studying the trade-offs between quantization and performance in LLMs
- Edge Deployment: Efficient for running on low-resource hardware, such as consumer-grade GPUs or edge devices
Usage Example
To use the model with the Hugging Face Transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Replace with your model repo name
model_name = "AIAlbus/LlamaGPTQ"
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
# Perform inference
inputs = tokenizer("What is the capital of France?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no library tag.