Tuti 🦜

This is a Gemma 2 9b, fined tuned using Unsloth's 4-bit quantization and LORA (QLORA), on Persian literature datasets I curated/created or found.

Use cases and datasets

Word IPA Detection

I have fined tuned this model with QLORA and only uploaded the LORA adapter, so it could be used like this:

# pip install unsloth
from unsloth import FastLanguageModel
from transformers import TextStreamer

model_name = "cnababaie/tuti"
max_seq_length = 4096  # Adjust as needed
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)
alpaca_prompt_template = """### Instruction:
{}

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
[
    alpaca_prompt_template.format(
        "IPA این کلمه چیست؟", # instruction
        "جوینده",
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

This will correctly output IPA as "/d͡ʒuːjænde/ (juyande)".

IPA Sources

IPA-dict: Monolingual wordlists with pronunciation information in IPA
Wiktionary: The Persian corpus don't contain IPA but the English one(which contains many words and phrases in other than English) are a lot of Persian words with their IPA

Persian Text Romanization

inputs = tokenizer(
[
    alpaca_prompt_template.format(
        "این متن چه تلفظی داره؟", # instruction
        "خاک به خاطر بارش زیاد باران گل شد.",
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

This will output exact pronunciation as "Xāk be xāter-e bāreš-e ziyād-e bārān gel šod.".

Romanization Sources

http://alefbaye2om.org/: Contain PDFs with Persian Romanized text

Persian Poem Translation

inputs = tokenizer(
[
    alpaca_prompt_template.format(
        "ترجمه", # instruction
        "برخیز بتا بیا ز بهر دل ما\r\nحل کن به جمال خویشتن مشکل ما\r\nیک کوزه شراب تا به هم نوش کن\r\nزآن پیش که کوزه‌ها کنند از گل ما",
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

This will output rhymed poetry with the original poem content:

"Arise, O idol, for our heart's sake, Solve our troubles with your beauty's make. One pot of wine, let's drink it all, Before they make pots from our clay's fall.".

Poem Translation Sources

Created list of random poems from Ganjoor and translation text pair

cnababaie
/

tuti