Instructions to use kaisser/LLM-Maroc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kaisser/LLM-Maroc with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kaisser/LLM-Maroc") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("kaisser/LLM-Maroc") model = AutoModelForCausalLM.from_pretrained("kaisser/LLM-Maroc") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use kaisser/LLM-Maroc with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="kaisser/LLM-Maroc", filename="llama.cpp/models/ggml-vocab-aquila.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use kaisser/LLM-Maroc with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf kaisser/LLM-Maroc:BF16 # Run inference directly in the terminal: llama-cli -hf kaisser/LLM-Maroc:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf kaisser/LLM-Maroc:BF16 # Run inference directly in the terminal: llama-cli -hf kaisser/LLM-Maroc:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf kaisser/LLM-Maroc:BF16 # Run inference directly in the terminal: ./llama-cli -hf kaisser/LLM-Maroc:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf kaisser/LLM-Maroc:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf kaisser/LLM-Maroc:BF16
Use Docker
docker model run hf.co/kaisser/LLM-Maroc:BF16
- LM Studio
- Jan
- vLLM
How to use kaisser/LLM-Maroc with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kaisser/LLM-Maroc" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kaisser/LLM-Maroc", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kaisser/LLM-Maroc:BF16
- SGLang
How to use kaisser/LLM-Maroc with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kaisser/LLM-Maroc" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kaisser/LLM-Maroc", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kaisser/LLM-Maroc" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kaisser/LLM-Maroc", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use kaisser/LLM-Maroc with Ollama:
ollama run hf.co/kaisser/LLM-Maroc:BF16
- Unsloth Studio new
How to use kaisser/LLM-Maroc with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kaisser/LLM-Maroc to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kaisser/LLM-Maroc to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for kaisser/LLM-Maroc to start chatting
- Docker Model Runner
How to use kaisser/LLM-Maroc with Docker Model Runner:
docker model run hf.co/kaisser/LLM-Maroc:BF16
- Lemonade
How to use kaisser/LLM-Maroc with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull kaisser/LLM-Maroc:BF16
Run and chat with the model
lemonade run user.LLM-Maroc-BF16
List all available models
lemonade list
LLaVA
Currently this implementation supports llava-v1.5 variants, as well as llava-1.6 llava-v1.6 variants.
The pre-converted 7b and 13b models are available. For llava-1.6 a variety of prepared gguf models are available as well 7b-34b
After API is confirmed, more models will be supported / uploaded.
Usage
Build the llama-mtmd-cli binary.
After building, run: ./llama-mtmd-cli to see the usage. For example:
./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \
--mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \
--chat-template vicuna
note: A lower temperature like 0.1 is recommended for better quality. add --temp 0.1 to the command to do so.
note: For GPU offloading ensure to use the -ngl flag just like usual
LLaVA 1.5
- Clone a LLaVA and a CLIP model (available options). For example:
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
git clone https://huggingface.co/openai/clip-vit-large-patch14-336
- Install the required Python packages:
pip install -r tools/mtmd/requirements.txt
- Use
llava_surgery.pyto split the LLaVA model to LLaMA and multimodel projector constituents:
python ./tools/mtmd/llava_surgery.py -m ../llava-v1.5-7b
- Use
convert_image_encoder_to_gguf.pyto convert the LLaVA image encoder to GGUF:
python ./tools/mtmd/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
- Use
examples/convert_legacy_llama.pyto convert the LLaMA part of LLaVA to GGUF:
python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
Now both the LLaMA part and the image encoder are in the llava-v1.5-7b directory.
LLaVA 1.6 gguf conversion
- First clone a LLaVA 1.6 model:
git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
- Install the required Python packages:
pip install -r tools/mtmd/requirements.txt
- Use
llava_surgery_v2.pywhich also supports llava-1.5 variants pytorch as well as safetensor models:
python tools/mtmd/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
- you will find a llava.projector and a llava.clip file in your model directory
- Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
mkdir vit
cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
cp ../llava-v1.6-vicuna-7b/llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
- Create the visual gguf model:
python ./tools/mtmd/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
- Then convert the model to gguf format:
python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
- And finally we can run the llava cli using the 1.6 model version:
./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf
note llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
note llava-1.6 greatly benefits from batched prompt processing (defaults work)
note if the language model in step 6) is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
import os
import transformers
model_path = ...
llm_export_path = ...
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
tokenizer.save_pretrained(llm_export_path)
model.language_model.save_pretrained(llm_export_path)
Then, you can convert the LLM using the convert_hf_to_gguf.py script, which handles more LLM architectures.
Chat template
For llava-1.5 and llava-1.6, you need to use vicuna chat template. Simply add --chat-template vicuna to activate this template.
How to know if you are running in llava-1.5 or llava-1.6 mode
When running llava-cli you will see a visual information right before the prompt is being processed:
Llava-1.5:
encode_image_with_clip: image embedding created: 576 tokens
Llava-1.6 (anything above 576):
encode_image_with_clip: image embedding created: 2880 tokens
Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6