Edit model card

4-bit Quantized Llama 3 Model

Description

This repository hosts the 4-bit quantized version of the Llama 3 model. Optimized for reduced memory usage and faster inference, this model is suitable for deployment in environments where computational resources are limited.

Model Details

  • Model Type: Transformer-based language model.
  • Quantization: 4-bit precision.
  • Advantages:
    • Memory Efficiency: Reduces memory usage significantly, allowing deployment on devices with limited RAM.
    • Inference Speed: Accelerates inference times, depending on the hardware's ability to process low-bit computations.

How to Use

To utilize this model efficiently, follow the steps below:

Loading the Quantized Model

Load the model with specific parameters to ensure it utilizes 4-bit precision:

from transformers import AutoModelForCausalLM

model_4bit = AutoModelForCausalLM.from_pretrained("SweatyCrayfish/llama-3-8b-quantized", device_map="auto", load_in_4bit=True)

Adjusting Precision of Components

Adjust the precision of other components, which are by default converted to torch.float16:

import torch
from transformers import AutoModelForCausalLM

model_4bit = AutoModelForCausalLM.from_pretrained("SweatyCrayfish/llama-3-8b-quantized", load_in_4bit=True, torch_dtype=torch.float32)
print(model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype)

Citation

Original repository and citations: @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} }

Downloads last month
2,153
Safetensors
Model size
8.03B params
Tensor type
FP16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using SweatyCrayfish/llama-3-8b-quantized 2