README.md · np-n/ministral-8b_Q8_0.gguf at ada55a21879a32c5d84db533e648f67de3aed22c

This is the 8-bit quantized model of the ministral-8B by Mistral-AI.Please follow the following instruction to run the model on your device:

There are multiple ways to infer the model. Firstly, let's install llama.cpp and use it for the inference:

Install

git clone https://github.com/ggerganov/llama.cpp
!mkdir llama.cpp/build && cd llama.cpp/build && cmake .. && cmake --build . --config Release

Inference

./llama.cpp/build/bin/llama-cli -m ./ministral-8b_Q8_0.gguf -cnv -p "You are a helpful assistant"

Here, you can interact with model from your terminal.

Alternatively, we can use python binding of the llama.cpp to run the model on both CPU and GPU.

Install

pip install --no-cache-dir llama-cpp-python==0.2.85 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

Inference on CPU

from llama_cpp import Llama

model_path = "./ministral-8b_Q8_0.gguf"
llm = Llama(model_path=model_path, n_threads=8, verbose=False)

prompt = "What should I do when my eyes are dry?"
output = llm(
        prompt=f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
        max_tokens=4096,
        stop=["<|end|>"],
        echo=False,  # Whether to echo the prompt
)
print(output)

Inference on GPU

from llama_cpp import Llama

model_path = "./ministral-8b_Q8_0.gguf"
llm = Llama(model_path=model_path, n_threads=8, n_gpu_layers=-1, verbose=False)

prompt = "What should I do when my eyes are dry?"
output = llm(
        prompt=f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
        max_tokens=4096,
        stop=["<|end|>"],
        echo=False,  # Whether to echo the prompt
)
print(output)