Mistral-7B

This repository contains the complete configuration, tokenizer, and optimized model weights (model.safetensors) for the Mistral-7B architecture. It is ready for deployment, inference, and downstream fine-tuning tasks.

1. About Mistral-7B

Mistral-7B is a highly efficient 7-billion parameter language model engineered for high performance and low latency.

Key Features:

Sliding Window Attention (SWA): Handles longer sequences with a lower memory footprint ($8k$ context length).
Grouped-query Attention (GQA): Enables faster inference times and reduces cache size during generation.
Byte-fallback BPE Tokenizer: Ensures that unknown characters never break the text processing pipeline.

Repository Structure:

model.safetensors: The primary tensor weights ($\approx 7.34$ GB optimized format).
config.json & generation_config.json: Architecture settings and text generation parameters (temperature, top_p).
tokenizer.json & tokenizer_config.json: The vocabulary mapping and tokenization configurations.
chat_template.jinja: Built-in template for structuring conversational inputs.

2. Setup & Installation

Follow these steps to configure your environment and run the model locally.

Hardware Requirements:

GPU: Minimum 12GB VRAM (e.g., RTX 3060 12GB, RTX 4060 Ti 16GB, or T4/A10G on Cloud).
RAM: 16GB system memory minimum.

Step 1: Install Dependencies

Ensure you have Python installed, then run the following command to install the required libraries:

pip install transformers torch accelerate safetensors

Step 2: Python Implementation Script Create a python file (e.g., run_inference.py) and use the official Hugging Face transformers pipeline to run the model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "sentinelapex/Mistral-7B"

print("Loading tokenizer and model...")
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load Model with FP16 precision for VRAM optimization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Define your prompt
prompt = "Explain the concept of Artificial Intelligence in three simple sentences."

# Format input tokens
inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

print("Generating response...")
# Generate tokens
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- AI Response ---")
print(response)

‌3. Recommended Generation Parameters

For optimal results, use the following parameters during generation setup:

Parameter             Value                          Description
-------------------------------------------------------------------------------------------------
temperature            0.7               Balances creativity and factual consistency.
-------------------------------------------------------------------------------------------------
top_p	               0.9	             Filters out low-probability words for smoother sentences.
-------------------------------------------------------------------------------------------------
do_sample	           True	             Enables probabilistic sampling instead of greedy decoding.

Mistral-7B Throughput

MMLU - KNOWLEDGE

ACCURACY

License

This model is distributed under the Apache-2.0 License. You are free to use, modify, and distribute it for both commercial and non-commercial applications.

Downloads last month: 117

Safetensors

Model size

7B params

Tensor type

F32

BF16

Model tree for mengleap-stnap/Mistral-7B

Base model

mistralai/Mistral-7B-v0.1

Quantized

(189)

this model