Ministral-8B-Instruct-2410-4bit (Optimized via Unsloth)

This repository contains a 4-bit quantized, text-only optimized version of Mistral AI's Ministral-8B-Instruct-2410. The model was quantized and patched locally using Unsloth to ensure minimal VRAM footprint and maximum training/inference efficiency on consumer-grade hardware (such as 16GB VRAM GPUs like the RTX 4070 Ti SUPER).

🦥 Key Features & Optimization

Architecture: Pure Text-Causal LM (MistralForCausalLM). Unlike multi-modal variants, this model is stripped of vision configurations to prevent VRAM overhead and text generation corruption (word salad issues).
Quantization: 4-bit NormalFloat (nf4) via bitsandbytes embedded directly into the shards.
Memory Footprint: Down from 32GB to **5.75 GB**, making it fully compatible with 16GB GPU setups for both deep reasoning fine-tuning and inference.
Vocabulary Size: 131,072 tokens, offering excellent multi-lingual compression, particularly for non-Latin scripts like Pashto.
Attention Mechanism: Features a mix of full_attention and sliding_attention layers (36 layers total), preserving deep contextual relationships over long inference steps.

🚀 Quick Start (Inference & Fine-Tuning)

To use this model seamlessly without triggering Hugging Face's weight reversion issues (NotImplementedError), load it directly using Unsloth's fast patching pipeline.

Prerequisites

Make sure you have unsloth, torch, and transformers installed in your environment:

pip install unsloth

1. Fast Inference Code

import torch
from unsloth import FastLanguageModel

max_seq_length = 4096
dtype = None # Auto-detects (bfloat16 for modern GPUs)
load_in_4bit = True

# Load optimized 4-bit model directly from this Hub repo
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "nassimjp/Ministral-8B-Instruct-2410-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    device_map = "auto"
)

FastLanguageModel.for_inference(model)

# Standard Chat Template Test
messages = [{"role": "user", "content": "سلام، په پښتو ژبه ووایه چې ته څوک یې؟"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2. Fine-Tuning Prep (PEFT/LoRA Setup)

If you are setting up this model for downstream tasks (such as specialized Pashto reasoning/CoT data alignment), initialize your LoRA target modules like this:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # Crucial for 16GB VRAM hardware safety
    random_state = 3407,
    use_rslora = False,
)
print("✅ Ready for V7 fine-tuning sequence.")

⚠️ Important Configuration Notes

Padding Token: The base Mistral models do not have a default padding token. When loaded via Unsloth, it automatically assigns pad_token = <pad> to ensure matrix mathematical safety during batched sequences.
Model Type: Hard-mapped to model_type: "mistral". Avoid manual conversion to vision/conditional blocks to maintain stability.

📜 Acknowledgements & License

Base Model: Developed by Mistral AI. Released under the mistral-research license.
Quantization Pipeline: Powered by Unsloth AI.

Downloads last month: 629

Safetensors

Model size

8B params

Tensor type

F32

BF16

Model tree for nassimjp/Ministral-8B-Instruct-2410-4bit

Base model

mistralai/Ministral-8B-Instruct-2410

Quantized

(68)

this model