LLaMA 3.2 1B GPTQ Quantized Model

This repository hosts the LLaMA 3.2 1B Instruct model, quantized using the GPTQ (Groupwise Quantization for Transformers) method with the C4 dataset for calibration. This quantized version is optimized for faster inference, reduced memory consumption, and efficient deployment in resource-constrained environments.

Model Details

Base Model: LLaMA 3.2 1B Instruct
Quantization Method: GPTQ (4-bit groupwise quantization)
Calibration Dataset: C4 Dataset
File Format: safetensors for secure and efficient storage
Purpose: Fine-tuned for instruction-following tasks and quantized for efficient inference

How the Model Was Quantized

The quantization was performed using the GPTQQuantizer from the Optimum library, with the following parameters:

Bits: 4 (for reduced memory footprint)
Block Quantization: Applied to model.layers
Calibration Dataset: C4 dataset for accurate activation distribution
Maximum Sequence Length: 2048 tokens
Hardware Used: [2 NVDIA H100 GPUs]

Quantization Script

The quantization process involved:

Model Loading: The model and tokenizer were loaded from Hugging Face.
GPTQ Quantization: Using the GPTQQuantizer with specific configurations for sequence length, dataset, and model layers.
Saving Artifacts: The quantized model and tokenizer were saved in the quantized_model/ directory.

View the quantization script here ().

Intended Use

This model is designed for:

Instruction-Following Tasks: Dialogue generation, question answering, and text completion
Research: Ideal for studying the trade-offs between quantization and performance in LLMs
Edge Deployment: Efficient for running on low-resource hardware, such as consumer-grade GPUs or edge devices

Usage Example

To use the model with the Hugging Face Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Replace with your model repo name
model_name = "AIAlbus/LlamaGPTQ"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# Perform inference
inputs = tokenizer("What is the capital of France?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))