Intro

First of all, we would like to express our gratitude to PartAI for their efforts in expanding large language models in the Persian language by releasing the "Dorna" model.

The quantized version of the "Dorna" language model requires only 6GB of GPU memory for loading, while the original model requires 40GB of GPU memory.

This model based on AWQ quantize method that decrease the volume of model in minimum decrease of Accuracy by changing type of weights

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


tokenizer = AutoTokenizer.from_pretrained("amir-ma71/Dorna-Llama3-8B-Instruct-AWQ")
model = AutoModelForCausalLM.from_pretrained(
    "amir-ma71/Dorna-Llama3-8B-Instruct-AWQ",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

messages = [
    {"role": "system",
     "content": "You are a helpful Persian assistant. Please answer questions in the asked language."},
    {"role": "user", "content": "پایتخت ایران کجاست؟"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Output

پایتخت ایران، تهران است. تهران شهر بزرگ و مهمی در مرکز ایران است که جمعیت قابل توجهی دارد و از مهم‌ترین شهرهای این کشور است.

Contributing

feel free to contact me:

Downloads last month
43
Safetensors
Model size
1.98B params
Tensor type
BF16
·
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using amir-ma71/Dorna-Llama3-8B-Instruct-AWQ 1