Mistral-7B

This repository contains the complete configuration, tokenizer, and optimized model weights (model.safetensors) for the Mistral-7B architecture. It is ready for deployment, inference, and downstream fine-tuning tasks.


1. About Mistral-7B

Mistral-7B is a highly efficient 7-billion parameter language model engineered for high performance and low latency.

Key Features:

  • Sliding Window Attention (SWA): Handles longer sequences with a lower memory footprint ($8k$ context length).
  • Grouped-query Attention (GQA): Enables faster inference times and reduces cache size during generation.
  • Byte-fallback BPE Tokenizer: Ensures that unknown characters never break the text processing pipeline.

Repository Structure:

  • model.safetensors: The primary tensor weights ($\approx 7.34$ GB optimized format).
  • config.json & generation_config.json: Architecture settings and text generation parameters (temperature, top_p).
  • tokenizer.json & tokenizer_config.json: The vocabulary mapping and tokenization configurations.
  • chat_template.jinja: Built-in template for structuring conversational inputs.

2. Setup & Installation

Follow these steps to configure your environment and run the model locally.

Hardware Requirements:

  • GPU: Minimum 12GB VRAM (e.g., RTX 3060 12GB, RTX 4060 Ti 16GB, or T4/A10G on Cloud).
  • RAM: 16GB system memory minimum.

Step 1: Install Dependencies

Ensure you have Python installed, then run the following command to install the required libraries:

pip install transformers torch accelerate safetensors

Step 2: Python Implementation Script Create a python file (e.g., run_inference.py) and use the official Hugging Face transformers pipeline to run the model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "sentinelapex/Mistral-7B"

print("Loading tokenizer and model...")
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load Model with FP16 precision for VRAM optimization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Define your prompt
prompt = "Explain the concept of Artificial Intelligence in three simple sentences."

# Format input tokens
inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

print("Generating response...")
# Generate tokens
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- AI Response ---")
print(response)

‌3. Recommended Generation Parameters

For optimal results, use the following parameters during generation setup:

Parameter             Value                          Description
-------------------------------------------------------------------------------------------------
temperature            0.7               Balances creativity and factual consistency.
-------------------------------------------------------------------------------------------------
top_p	               0.9	             Filters out low-probability words for smoother sentences.
-------------------------------------------------------------------------------------------------
do_sample	           True	             Enables probabilistic sampling instead of greedy decoding.

Mistral-7B Throughput

Pic3

MMLU - KNOWLEDGE

Pic1

ACCURACY

Pic2

License

This model is distributed under the Apache-2.0 License. You are free to use, modify, and distribute it for both commercial and non-commercial applications.

Downloads last month
117
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mengleap-stnap/Mistral-7B

Quantized
(189)
this model