YutaLM-M2-bnb-4bit

YutaLM-M2-bnb-4bit is a lightweight, highly optimized 350M parameter hybrid architecture model fine-tuned specifically for Arabic Chat and Roleplay.

Developed under the LiteMind initiative, this model addresses the classic limitations of small-scale language models handling complex Arabic morphology and dialogue. By leveraging advanced target module tuning alongside direct training on Arabic linguistic embeddings (embed_tokens and lm_head), YutaLM-M2 delivers fluid, contextually expressive, and culturally nuanced Arabic interactions while maintaining an incredibly small hardware footprint.


๐Ÿš€ Key Features

Feature Description
Tailored for Arabic Roleplay & Chat Fine-tuned on conversational datasets to provide engaging, expressive, and grammatically sound Arabic outputs.
Hybrid Architecture Optimization Safely trained using advanced gradient handling to perfectly accommodate the model's specialized convolution (conv) and recurrence layer mechanics.
Enhanced Arabic Tokenization Unlike naive fine-tunes, this model had its vocabulary embeddings and language model head explicitly trained to prevent letter-mixing and broken script generation.
Ultra-Low Resource Footprint Pre-quantized and merged natively in 4-bit using bitsandbytes (forced_merged_4bit), allowing seamless deployment and rapid inference on consumer-grade GPUs or free cloud tiers (like Google Colab T4).

๐Ÿ“ Prompt Template & Format

The model utilizes the standard ChatML (LFM) sequence format. To achieve the best conversational or roleplaying stance, structure your inputs as follows:

<|im_start|>user
{Your Prompt Here}
<|im_end|>
<|im_start|>assistant

โš ๏ธ Note: To prevent repetitive or broken generation artifacts common in smaller models, it is highly recommended to enforce a slight repetition penalty (1.05) and set a moderate temperature during inference.


๐Ÿ’ป Quick Start & Inference

You can easily run this model using the Hugging Face transformers library. Ensure you have bitsandbytes and accelerate installed.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name = "LiteMind/YutaLM-M2-bnb-4bit"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare the conversation
messages = [
    {"role": "user", "content": "ู…ุฑุญุจุงู‹ ูŠุง ูŠูˆุชุงุŒ ูƒูŠู ูŠู…ูƒู†ู†ูŠ ุจุฑู…ุฌุฉ ู†ู…ูˆุฐุฌ ุฐูƒุงุก ุงุตุทู†ุงุนูŠ ุตุบูŠุฑุŸ"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to("cuda")

# Initialize streamer for real-time output
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate response
print("Assistant: ")
_ = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.73,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.05,
    streamer=streamer
)

โš™๏ธ Development & Optimization

The model was meticulously fine-tuned using the Unsloth framework on a single-GPU instance.

To ensure the model didn't lose its linguistic capabilities during adaptation, a masked training approach was utilized to calculate loss exclusively on the assistant's rich Arabic responses. Additionally, vocabulary layers were dynamically updated to absorb Arabic stylistic semantics, preventing the typical degradation seen in smaller post-quantized models.


๐Ÿ›‘ Limitations & Biases

  • Due to its compact 350M parameter size, YutaLM-M2 should be treated as a specialized creative and conversational assistant rather than a factual encyclopedia.
  • It may occasionally experience hallucinations if prompted with highly complex logical/mathematical problems.
  • Performance is heavily optimized for Arabic; performance on multi-lingual switching or raw English coding might be limited compared to generalist base models.

๐Ÿ“œ License

This model is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). You are free to use, modify, and host this model, provided that any derivative works, modified versions, or web services leveraging this model are also open-sourced under the same AGPL-3.0 terms.


๐Ÿค Acknowledgements

Special thanks to the Unsloth team for providing the memory-efficient frameworks that make fine-tuning hybrid, low-parameter models accessible and highly performant.

Downloads last month
48
Safetensors
Model size
0.9B params
Tensor type
F32
ยท
F16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Model tree for LiteMind/YutaLM-M2-bnb-4bit

Quantized
(35)
this model