Instructions to use deepreinforce-ai/Ornith-1.0-397B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepreinforce-ai/Ornith-1.0-397B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepreinforce-ai/Ornith-1.0-397B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("deepreinforce-ai/Ornith-1.0-397B") model = AutoModelForMultimodalLM.from_pretrained("deepreinforce-ai/Ornith-1.0-397B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepreinforce-ai/Ornith-1.0-397B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepreinforce-ai/Ornith-1.0-397B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepreinforce-ai/Ornith-1.0-397B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepreinforce-ai/Ornith-1.0-397B
- SGLang
How to use deepreinforce-ai/Ornith-1.0-397B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepreinforce-ai/Ornith-1.0-397B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepreinforce-ai/Ornith-1.0-397B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepreinforce-ai/Ornith-1.0-397B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepreinforce-ai/Ornith-1.0-397B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepreinforce-ai/Ornith-1.0-397B with Docker Model Runner:
docker model run hf.co/deepreinforce-ai/Ornith-1.0-397B
GGUFs of 397B
Will ggufs of 397B be uploaded? Would they contain mtp layers?
There's no MTP layer shipped with the model. I can try to graft Qwen 3.5 MTP layer on it and see what happens. Probably works fine tbh.
They ship it with vision layers btw, so that should work.
There's no MTP layer shipped with the model. I can try to graft Qwen 3.5 MTP layer on it and see what happens. Probably works fine tbh.
They ship it with vision layers btw, so that should work.
any eta for ornith like this one https://huggingface.co/paragon-of-brah/Qwen3.5-397B-A17B-IQ5_k-MTP-GGUF
any eta for ornith like this one https://huggingface.co/paragon-of-brah/Qwen3.5-397B-A17B-IQ5_k-MTP-GGUF
My upload speed is bad, I need around 20 hours to upload a model. Probably I'll upload the first Ornith 397B model without MTP within ~30 hours. After that I'll upload the same model with MTP. It'll take much less time as huggingface intelligently copies the weights it already has on its servers instead of making me upload them again - I'll only have to upload the additional MTP layer which should take less then an hour.
Expect Ornith IQ2_M without MTP in ~30 hours, with MTP less then an hour later. Then Q5 without MTP in ~50 hours, with MTP an hour later.
Unless you want smaller quants? I upload smaller quants first on request.
Btw that Q5 is ik_llama only. You use llama.cpp or ik_llama.cpp?
any eta for ornith like this one https://huggingface.co/paragon-of-brah/Qwen3.5-397B-A17B-IQ5_k-MTP-GGUF
My upload speed is bad, I need around 20 hours to upload a model. Probably I'll upload the first Ornith 397B model without MTP within ~30 hours. After that I'll upload the same model with MTP. It'll take much less time as huggingface intelligently copies the weights it already has on its servers instead of making me upload them again - I'll only have to upload the additional MTP layer which should take less then an hour.
Expect Ornith IQ2_M without MTP in ~30 hours, with MTP less then an hour later. Then Q5 without MTP in ~50 hours, with MTP an hour later.
Unless you want smaller quants? I upload smaller quants first on request.
Btw that Q5 is ik_llama only. You use llama.cpp or ik_llama.cpp?
i use both but mostly ik_llama with llama swap @ubergarm used to give us model but seems like he is busy now.
Right. I think I should focus on ik_llama.cpp more then. Indeed I don't see many quants for ik either, especially for lesser known models.
Tbh there's always someone making mainline quants, maybe I should only focus on ik_llama..
I'll make a mainline IQ2_M since it has been requested, then I'll do a variety of ik quants.
Right. I think I should focus on ik_llama.cpp more then. Indeed I don't see many quants for ik either, especially for lesser known models.
Tbh there's always someone making mainline quants, maybe I should only focus on ik_llama..
I'll make a mainline IQ2_M since it has been requested, then I'll do a variety of ik quants.
thanks sir. i have only 256gb so i cant make quant for 400b models but if it is within my reach i will try to make quant.
thanks sir. i have only 256gb so i cant make 400b models for quant but if it is within my reach i will try to make quant.
You totally can. I have 256GB too, and a 16 cores processor. Making one quant only takes a few hours. You don't load the whole model in RAM to do it, you only load one tensor per time, peak memory usage is around 20GB. The only thing you are going to struggle with is the imatrix.
thanks sir. i have only 256gb so i cant make 400b models for quant but if it is within my reach i will try to make quant.
You totally can. I have 256GB too, and a 16 cores processor. Making one quant only takes a few hours. You don't load the whole model in RAM to do it, you only load one tensor per time, peak memory usage is around 20GB. The only thing you are going to struggle with is the imatrix.
really can you give me script or commend that you having ? i might try one my self
Tutorial for Qwen 3.5 quantization (or any finetunes):
1 - download mainline + the BF16 safetensor file
2 - build + install python requirements inside of llama.cpp/requirements/requirements-convert_hf_to_gguf.txt inside of a venv
3 - run llama.cpp/convert_hf_to_gguf.py to convert the BF 16 safetensor into BF16 GGUF with:
#!/bin/bash
cd ../llama.cpp
source .venv/bin/activate
# --dry-run to only output tensor names
# 2. Recommended conversion command for BF16
python convert_hf_to_gguf.py \
/path/to/gguf-BF16/BF16safetensors/ \
--outfile /path/to/gguf-BF16/BF16.gguf \
--outtype bf16 \
--split-max-size 20G \
--verbose
4 - use ik_llama.cpp to quantize to whatever quant you want via /ik_llama.cpp/build/bin/llama-quantize via this script
#!/bin/bash
custom="
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=bf16
blk\..*\.attn_qkv\.weight=bf16
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=bf16
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_ks
# MTP [60]
blk\.60\.attn_k\.weight=q8_0
blk\.60\.attn_q\.weight=q8_0
blk\.60\.attn_v\.weight=q8_0
blk\.60\.attn_output\.weight=q8_0
blk\.60\.ffn.*shexp\.weight=q8_0
blk\.60\.ffn.*exps\.weight=iq3_xxs
blk\.60\.ffn.*inp\.weight=bf16
blk\.60\.nextn.*\.weight=q8_0
blk\.60\..*norm\.weight=f32
# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
ik_llama/build/bin/llama-quantize \
--custom-q "$custom" \
./gguf-bf16/model-00001-of-00045.gguf \
./Output-model-IQ4_KSS.gguf \
IQ4_KSS \
16
# The 16 is your number of cores
#optional additional arguments:
#--imatrix ./imatrix-your-imatrix-BF16.dat \
#--extra-output-tensor iq4_ks \
This assumes the base BF16 has MTP layers