Instructions to use kyr0/Ornith-35B-FP8-E4M3-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kyr0/Ornith-35B-FP8-E4M3-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kyr0/Ornith-35B-FP8-E4M3-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("kyr0/Ornith-35B-FP8-E4M3-MTP")
model = AutoModelForMultimodalLM.from_pretrained("kyr0/Ornith-35B-FP8-E4M3-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kyr0/Ornith-35B-FP8-E4M3-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kyr0/Ornith-35B-FP8-E4M3-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kyr0/Ornith-35B-FP8-E4M3-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kyr0/Ornith-35B-FP8-E4M3-MTP

SGLang

How to use kyr0/Ornith-35B-FP8-E4M3-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kyr0/Ornith-35B-FP8-E4M3-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kyr0/Ornith-35B-FP8-E4M3-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kyr0/Ornith-35B-FP8-E4M3-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kyr0/Ornith-35B-FP8-E4M3-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kyr0/Ornith-35B-FP8-E4M3-MTP with Docker Model Runner:
```
docker model run hf.co/kyr0/Ornith-35B-FP8-E4M3-MTP
```

kyr0/Ornith-35B-FP8-E4M3-MTP

The same great deepreinforce-ai/Ornith-1.0-35B model, but:

~18% faster because of Speculative Decoding (MTP) model (only +800 MB VRAM consumption at runtime)
in E4M3 FP8 precision (1 sign bit + 4 exponent bits + 3 mantissa bits) - smarter FP8 precision for slightly higher precision
KV cache quantization using E4M3 FP8 as well
FP8 model weights: protoLabsAI/Ornith-1.0-35B-FP8 (E4M3, 35.8 GB on disk)
FP8 MTP weights: Capicua25x/Ornith-1.0-35B-MXFP4-Vision-MTP (1.6 GB on disk)

How does it work?

My graft script does a bit of magic:

downloaded orignal model weights (target) and donor MTP sidecar weights via hf
grafts MTP sidecar into target trunk
extracts and adds 785 mtp.* tensors to model.safetensors.index.json
skips MTP FP8 scale tensors
marks 778 MTP Linear modules as unquantized/ignored
sets num_nextn_predict_layers=1
leaves original target safetensors shards unchanged

Generation Performance (Token/sec generated) vs. Baseline

MTP / Speculative Decoding ENABLED

Metric	Value
Successful / Failed Requests	32 / 0
Total Wall Time	159.47 sec
Average Request Time	19.11 sec
Max Request Time	27.48 sec
Total Prompt Tokens	1,944
Total Completion Tokens	119,769
Total Tokens Processed	121,713
Completion Throughput	751.06 tok/sec
Total Throughput	763.25 tok/sec

SERVED_NAME="ornith-35b-fp8-e4m3-mtp" \
REQS="32" \
PARALLEL="4" \
MAX_TOKENS="4096" \
TEMP="0.6" \
TOP_P="0.95" \
TOP_K="20" \
PROMPTS_FILE="" \
SHUFFLE_PROMPTS="1" \
RANDOM_NONCE="1" \
./stress.sh
==> Stress test
    url=http://127.0.0.1:8998/v1/chat/completions
    model=ornith-35b-fp8-e4m3-mtp
    reqs=32 parallel=4 max_tokens=4096
    prompts=/tmp/tmp.Lx42dinyo3/default-prompts.txt count=20 shuffle=1 nonce=1

id status seconds prompt_tokens completion_tokens total_tokens
10  200  15.468765  65  3458  3523
11  200  17.042949  57  3243  3300
12  200  17.226863  57  3213  3270
13  200  19.818083  62  3677  3739
14  200  20.236654  77  4096  4173
15  200  18.975290  68  4096  4164
16  200  19.780207  62  4096  4158
17  200  17.789587  58  3381  3439
18  200  19.140477  57  3906  3963
19  200  19.095998  66  4096  4162
1   200  26.982888  61  3764  3825
20  200  21.414041  62  4096  4158
21  200  18.708545  53  4096  4149
22  200  18.280602  64  4096  4160
23  200  16.905085  55  3445  3500
24  200  19.210371  53  4096  4149
25  200  20.722689  77  4096  4173
26  200  18.112546  59  4096  4155
27  200  16.745625  60  3269  3329
28  200  19.964553  57  4096  4153
29  200  20.624861  77  4096  4173
2   200  24.681820  56  3746  3802
30  200  18.958027  62  4096  4158
31  200  16.092866  58  3277  3335
32  200  14.789030  65  3922  3987
3   200  27.475965  55  4096  4151
4   200  23.244902  55  3343  3398
5   200  19.629941  52  4096  4148
6   200  12.735978  62  2328  2390
7   200  20.976643  61  4096  4157
8   200  19.030427  52  4096  4148
9   200  11.501070  59  2165  2224

==> Summary
ok=32 fail=0
wall_sec=159.467
avg_req_sec=19.105 max_req_sec=27.476
prompt_tokens=1944 completion_tokens=119769 total_tokens=121713
completion_tok_per_sec=751.06
total_tok_per_sec=763.25

Baseline: Speculative Decoding DISABLED

Metric	Value
Successful / Failed Requests	32 / 0
Total Wall Time	163.17 sec
Average Request Time	19.56 sec
Max Request Time	26.79 sec
Total Prompt Tokens	2,028
Total Completion Tokens	103,571
Total Tokens Processed	105,599
Completion Throughput	634.73 tok/sec
Total Throughput	647.16 tok/sec

make stress
chmod +x ./stress.sh
PORT="8998" \
API_KEY="local-dev-key" \
SERVED_NAME="ornith-35b-fp8-e4m3-mtp" \
REQS="32" \
PARALLEL="4" \
MAX_TOKENS="4096" \
TEMP="0.6" \
TOP_P="0.95" \
TOP_K="20" \
PROMPTS_FILE="" \
SHUFFLE_PROMPTS="1" \
RANDOM_NONCE="1" \
./stress.sh
==> Stress test
    url=http://127.0.0.1:8998/v1/chat/completions
    model=ornith-35b-fp8-e4m3-mtp
    reqs=32 parallel=4 max_tokens=4096
    prompts=/tmp/tmp.0nT1f4XYq0/default-prompts.txt count=20 shuffle=1 nonce=1

id status seconds prompt_tokens completion_tokens total_tokens
10  200  24.386988  57  4096  4153
11  200  24.412515  60  4096  4156
12  200  24.397616  76  4096  4172
13  200  14.089454  68  2362  2430
14  200  21.222193  67  3552  3619
15  200  13.490376  56  2276  2332
16  200  21.625402  62  3628  3690
17  200  18.498888  57  3095  3152
18  200  24.454264  61  4096  4157
19  200  17.620515  59  2972  3031
1   200  17.782799  61  2586  2647
20  200  19.214265  56  3233  3289
21  200  18.161814  74  3055  3129
22  200  24.359344  67  4096  4163
23  200  24.411447  57  4096  4153
24  200  24.442081  63  4096  4159
25  200  14.531877  60  2440  2500
26  200  13.009990  63  2177  2240
27  200  6.885480   58  1152  1210
28  200  24.296608  76  4096  4172
29  200  15.699833  60  2643  2703
2   200  18.248555  56  2656  2712
30  200  6.860422   59  1154  1213
31  200  23.504682  63  4096  4159
32  200  17.236493  72  3049  3121
3   200  26.789070  67  4096  4163
4   200  26.788966  76  4096  4172
5   200  21.046680  73  3479  3552
6   200  15.875098  62  2615  2677
7   200  15.088136  62  2481  2543
8   200  24.676287  54  4096  4150
9   200  22.677346  66  3814  3880

==> Summary
ok=32 fail=0
wall_sec=163.173
avg_req_sec=19.556 max_req_sec=26.789
prompt_tokens=2028 completion_tokens=103571 total_tokens=105599
completion_tok_per_sec=634.73
total_tok_per_sec=647.16

MTP / Speculative Decoding performance report

The following measurements were conducted on a single NVIDIA H200 NVL 141GB with --speculative-config '{"method":"mtp","num_speculative_tokens":2}'. See the Makefile for the exact parameters used to run the vLLM server with the benchmarked configuration.

Metric	Samples	Min	Max	Mean
Mean Acceptance Length	16	2.24	2.67	2.38
Avg Draft Acceptance Rate	16	62.1%	83.4%	69.2%
Per-position Acceptance Rate P1	16	73.1%	89.4%	78.7%
Per-position Acceptance Rate P2	16	51.2%	77.5%	59.7%

Serving

vllm serve ./Ornith-1.0-35B-FP8-MTP \
  --served-model-name ornith-35b-fp8-mtp \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Downloads last month: -

Safetensors

Model size

36B params

Tensor type

BF16

F8_E4M3

Model tree for kyr0/Ornith-35B-FP8-E4M3-MTP

Base model

deepreinforce-ai/Ornith-1.0-35B

Quantized

protoLabsAI/Ornith-1.0-35B-FP8

Quantized

(1)

this model