Instructions to use deepreinforce-ai/Ornith-1.0-397B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepreinforce-ai/Ornith-1.0-397B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepreinforce-ai/Ornith-1.0-397B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("deepreinforce-ai/Ornith-1.0-397B")
model = AutoModelForMultimodalLM.from_pretrained("deepreinforce-ai/Ornith-1.0-397B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepreinforce-ai/Ornith-1.0-397B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepreinforce-ai/Ornith-1.0-397B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepreinforce-ai/Ornith-1.0-397B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepreinforce-ai/Ornith-1.0-397B

SGLang

How to use deepreinforce-ai/Ornith-1.0-397B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepreinforce-ai/Ornith-1.0-397B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepreinforce-ai/Ornith-1.0-397B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepreinforce-ai/Ornith-1.0-397B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepreinforce-ai/Ornith-1.0-397B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepreinforce-ai/Ornith-1.0-397B with Docker Model Runner:
```
docker model run hf.co/deepreinforce-ai/Ornith-1.0-397B
```

GGUFs of 397B

by DrRos - opened 2 days ago

Discussion

DrRos

2 days ago

Will ggufs of 397B be uploaded? Would they contain mtp layers?

paragon-of-brah

2 days ago

There's no MTP layer shipped with the model. I can try to graft Qwen 3.5 MTP layer on it and see what happens. Probably works fine tbh.

They ship it with vision layers btw, so that should work.

gopi87

1 day ago

There's no MTP layer shipped with the model. I can try to graft Qwen 3.5 MTP layer on it and see what happens. Probably works fine tbh.

They ship it with vision layers btw, so that should work.

any eta for ornith like this one https://huggingface.co/paragon-of-brah/Qwen3.5-397B-A17B-IQ5_k-MTP-GGUF

paragon-of-brah

1 day ago

any eta for ornith like this one https://huggingface.co/paragon-of-brah/Qwen3.5-397B-A17B-IQ5_k-MTP-GGUF

My upload speed is bad, I need around 20 hours to upload a model. Probably I'll upload the first Ornith 397B model without MTP within ~30 hours. After that I'll upload the same model with MTP. It'll take much less time as huggingface intelligently copies the weights it already has on its servers instead of making me upload them again - I'll only have to upload the additional MTP layer which should take less then an hour.

Expect Ornith IQ2_M without MTP in ~30 hours, with MTP less then an hour later. Then Q5 without MTP in ~50 hours, with MTP an hour later.

Unless you want smaller quants? I upload smaller quants first on request.

Btw that Q5 is ik_llama only. You use llama.cpp or ik_llama.cpp?

gopi87

1 day ago

any eta for ornith like this one https://huggingface.co/paragon-of-brah/Qwen3.5-397B-A17B-IQ5_k-MTP-GGUF

My upload speed is bad, I need around 20 hours to upload a model. Probably I'll upload the first Ornith 397B model without MTP within ~30 hours. After that I'll upload the same model with MTP. It'll take much less time as huggingface intelligently copies the weights it already has on its servers instead of making me upload them again - I'll only have to upload the additional MTP layer which should take less then an hour.

Expect Ornith IQ2_M without MTP in ~30 hours, with MTP less then an hour later. Then Q5 without MTP in ~50 hours, with MTP an hour later.

Unless you want smaller quants? I upload smaller quants first on request.

Btw that Q5 is ik_llama only. You use llama.cpp or ik_llama.cpp?

i use both but mostly ik_llama with llama swap @ubergarm used to give us model but seems like he is busy now.

paragon-of-brah

1 day ago

Right. I think I should focus on ik_llama.cpp more then. Indeed I don't see many quants for ik either, especially for lesser known models.

Tbh there's always someone making mainline quants, maybe I should only focus on ik_llama..

I'll make a mainline IQ2_M since it has been requested, then I'll do a variety of ik quants.

gopi87

1 day ago

•

edited 1 day ago

Right. I think I should focus on ik_llama.cpp more then. Indeed I don't see many quants for ik either, especially for lesser known models.

Tbh there's always someone making mainline quants, maybe I should only focus on ik_llama..

I'll make a mainline IQ2_M since it has been requested, then I'll do a variety of ik quants.

thanks sir. i have only 256gb so i cant make quant for 400b models but if it is within my reach i will try to make quant.

paragon-of-brah

1 day ago

thanks sir. i have only 256gb so i cant make 400b models for quant but if it is within my reach i will try to make quant.

You totally can. I have 256GB too, and a 16 cores processor. Making one quant only takes a few hours. You don't load the whole model in RAM to do it, you only load one tensor per time, peak memory usage is around 20GB. The only thing you are going to struggle with is the imatrix.

gopi87

1 day ago

thanks sir. i have only 256gb so i cant make 400b models for quant but if it is within my reach i will try to make quant.

You totally can. I have 256GB too, and a 16 cores processor. Making one quant only takes a few hours. You don't load the whole model in RAM to do it, you only load one tensor per time, peak memory usage is around 20GB. The only thing you are going to struggle with is the imatrix.

really can you give me script or commend that you having ? i might try one my self

paragon-of-brah

1 day ago

Tutorial for Qwen 3.5 quantization (or any finetunes):

1 - download mainline + the BF16 safetensor file
2 - build + install python requirements inside of llama.cpp/requirements/requirements-convert_hf_to_gguf.txt inside of a venv
3 - run llama.cpp/convert_hf_to_gguf.py to convert the BF 16 safetensor into BF16 GGUF with:

#!/bin/bash

cd ../llama.cpp

source .venv/bin/activate

#  --dry-run to only output tensor names

# 2. Recommended conversion command for BF16
python convert_hf_to_gguf.py \
  /path/to/gguf-BF16/BF16safetensors/ \
  --outfile /path/to/gguf-BF16/BF16.gguf \
  --outtype bf16 \
  --split-max-size 20G \
  --verbose

4 - use ik_llama.cpp to quantize to whatever quant you want via /ik_llama.cpp/build/bin/llama-quantize via this script

#!/bin/bash

custom="
# 60 Repeating Layers [0-59] + MTP

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=bf16
blk\..*\.attn_qkv\.weight=bf16
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=bf16

# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_ks

# MTP [60]
blk\.60\.attn_k\.weight=q8_0
blk\.60\.attn_q\.weight=q8_0
blk\.60\.attn_v\.weight=q8_0
blk\.60\.attn_output\.weight=q8_0

blk\.60\.ffn.*shexp\.weight=q8_0
blk\.60\.ffn.*exps\.weight=iq3_xxs
blk\.60\.ffn.*inp\.weight=bf16

blk\.60\.nextn.*\.weight=q8_0

blk\.60\..*norm\.weight=f32

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

ik_llama/build/bin/llama-quantize \
    --custom-q "$custom" \
    ./gguf-bf16/model-00001-of-00045.gguf \
    ./Output-model-IQ4_KSS.gguf \
    IQ4_KSS \
    16

# The 16 is your number of cores

#optional additional arguments:
#--imatrix ./imatrix-your-imatrix-BF16.dat \
#--extra-output-tensor iq4_ks \

This assumes the base BF16 has MTP layers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment