license: apache-2.0
language:
- 'no'
- nb
library_name: transformers
tags:
- noramistral, finetune, norwegian
NoraMistral-7B-warm-instruct-GGUF
- Original model: NorMistral-7b-warm-instruct
- Base model: - NorMistral-7b-warm -- an LLM initialized from Mistral-7b-v0.1
This repo contains GGUF formatted files for the instruction model.
With the added pre-tokenizer ggml-vocab-normistral-7b-warm.gguf
.
This should remove the warning you will otherwise get in llama.cpp
with the original gguf files.
⚠️⚠️
To take advantage of this workaround, changes also needs to be made directly in llama.cpp.
- See an example of this here: GitHub or clone the fork.
- I recommend using the server provided by llama.cpp to get and OpenAI API endpoint.
Norwegian Large Language Models (from original card)
This is a model instruction-tuned on open datasets released under the most permissive apache-2.0 licence (in other words, we don't use any datasets generated by ChatGPT) — thus we can release this model under the same license and make it openly available for commercial applications. The model has been finetuned on 4096 context length, twice as many tokens as the base model.
The released weights are still a work in progress and they might change in the future. This is the first iteration of instruction-tuning our NorMistral models and it currently uses only the SFT phase without any preference optimization. Please let us know your feedback to improve the model in future releases.
How to run the model?
1. Prompt format
NorMistral uses ChatML-like format for structuring the (multi-turn) conversations. An example of a prompt in this format looks like the following (notice the special <|im_start|>
and <|im_end|>
tokens).
<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
<|im_start|> user
Gi meg en liste over de beste stedene å besøke i hovedstaden<|im_end|>
<|im_start|> assistant
How to run from Python code
You can use GGUF models from Python using the llama-cpp-python for example.
How to load this model in Python code, using llama-cpp-python
For full documentation, please see: llama-cpp-python docs.
First install the package
Run one of the following commands, according to your system:
# Base llama-ccp-python with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama-cpp-python
Simple llama-cpp-python example code
from llama_cpp import Llama
# Directly from huggingface-hub (requires huggingface-hub to be installed)
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama.from_pretrained(
repo_id="MagnusSa/noramistral-7B-warm-instruct-GGUF-not-official", # HuggingFace repository containing the GGUF files.
filename="*Q4_K_M.gguf", # suffix of the filename containing the level of quantization.
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=33 # The number of layers to offload to GPU, if you have GPU acceleration available
chat_format = "chatml" # The chat format that will be used for chat completions
)
# Simple inference example
output = llm(
"""<s><|im_start|> user
Hva kan jeg bruke einstape til?<|im_end|>
<|im_start|> assistant
""", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=["<|im_end|>"], # Example stop token
echo=True, # Whether to echo the prompt
temperature=0.3 # Temperature to set, for Q3_K_M, Q4_K_M, Q5_K_M, and Q6_0 it is recommended to set it relatively low.
)
# Chat Completion API
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "Hva kan jeg bruke einstape til?"
}
]
)