kyr0/Ornith-35B-FP8-E4M3-MTP

The same great deepreinforce-ai/Ornith-1.0-35B model, but:

  • ~18% faster because of Speculative Decoding (MTP) model (only +800 MB VRAM consumption at runtime)
  • in E4M3 FP8 precision (1 sign bit + 4 exponent bits + 3 mantissa bits) - smarter FP8 precision for slightly higher precision
  • KV cache quantization using E4M3 FP8 as well
  • FP8 model weights: protoLabsAI/Ornith-1.0-35B-FP8 (E4M3, 35.8 GB on disk)
  • FP8 MTP weights: Capicua25x/Ornith-1.0-35B-MXFP4-Vision-MTP (1.6 GB on disk)

How does it work?

My graft script does a bit of magic:

  • downloaded orignal model weights (target) and donor MTP sidecar weights via hf
  • grafts MTP sidecar into target trunk
  • extracts and adds 785 mtp.* tensors to model.safetensors.index.json
  • skips MTP FP8 scale tensors
  • marks 778 MTP Linear modules as unquantized/ignored
  • sets num_nextn_predict_layers=1
  • leaves original target safetensors shards unchanged

Generation Performance (Token/sec generated) vs. Baseline

MTP / Speculative Decoding ENABLED

Metric Value
Successful / Failed Requests 32 / 0
Total Wall Time 159.47 sec
Average Request Time 19.11 sec
Max Request Time 27.48 sec
Total Prompt Tokens 1,944
Total Completion Tokens 119,769
Total Tokens Processed 121,713
Completion Throughput 751.06 tok/sec
Total Throughput 763.25 tok/sec
SERVED_NAME="ornith-35b-fp8-e4m3-mtp" \
REQS="32" \
PARALLEL="4" \
MAX_TOKENS="4096" \
TEMP="0.6" \
TOP_P="0.95" \
TOP_K="20" \
PROMPTS_FILE="" \
SHUFFLE_PROMPTS="1" \
RANDOM_NONCE="1" \
./stress.sh
==> Stress test
    url=http://127.0.0.1:8998/v1/chat/completions
    model=ornith-35b-fp8-e4m3-mtp
    reqs=32 parallel=4 max_tokens=4096
    prompts=/tmp/tmp.Lx42dinyo3/default-prompts.txt count=20 shuffle=1 nonce=1

id status seconds prompt_tokens completion_tokens total_tokens
10  200  15.468765  65  3458  3523
11  200  17.042949  57  3243  3300
12  200  17.226863  57  3213  3270
13  200  19.818083  62  3677  3739
14  200  20.236654  77  4096  4173
15  200  18.975290  68  4096  4164
16  200  19.780207  62  4096  4158
17  200  17.789587  58  3381  3439
18  200  19.140477  57  3906  3963
19  200  19.095998  66  4096  4162
1   200  26.982888  61  3764  3825
20  200  21.414041  62  4096  4158
21  200  18.708545  53  4096  4149
22  200  18.280602  64  4096  4160
23  200  16.905085  55  3445  3500
24  200  19.210371  53  4096  4149
25  200  20.722689  77  4096  4173
26  200  18.112546  59  4096  4155
27  200  16.745625  60  3269  3329
28  200  19.964553  57  4096  4153
29  200  20.624861  77  4096  4173
2   200  24.681820  56  3746  3802
30  200  18.958027  62  4096  4158
31  200  16.092866  58  3277  3335
32  200  14.789030  65  3922  3987
3   200  27.475965  55  4096  4151
4   200  23.244902  55  3343  3398
5   200  19.629941  52  4096  4148
6   200  12.735978  62  2328  2390
7   200  20.976643  61  4096  4157
8   200  19.030427  52  4096  4148
9   200  11.501070  59  2165  2224

==> Summary
ok=32 fail=0
wall_sec=159.467
avg_req_sec=19.105 max_req_sec=27.476
prompt_tokens=1944 completion_tokens=119769 total_tokens=121713
completion_tok_per_sec=751.06
total_tok_per_sec=763.25

Baseline: Speculative Decoding DISABLED

Metric Value
Successful / Failed Requests 32 / 0
Total Wall Time 163.17 sec
Average Request Time 19.56 sec
Max Request Time 26.79 sec
Total Prompt Tokens 2,028
Total Completion Tokens 103,571
Total Tokens Processed 105,599
Completion Throughput 634.73 tok/sec
Total Throughput 647.16 tok/sec
make stress
chmod +x ./stress.sh
PORT="8998" \
API_KEY="local-dev-key" \
SERVED_NAME="ornith-35b-fp8-e4m3-mtp" \
REQS="32" \
PARALLEL="4" \
MAX_TOKENS="4096" \
TEMP="0.6" \
TOP_P="0.95" \
TOP_K="20" \
PROMPTS_FILE="" \
SHUFFLE_PROMPTS="1" \
RANDOM_NONCE="1" \
./stress.sh
==> Stress test
    url=http://127.0.0.1:8998/v1/chat/completions
    model=ornith-35b-fp8-e4m3-mtp
    reqs=32 parallel=4 max_tokens=4096
    prompts=/tmp/tmp.0nT1f4XYq0/default-prompts.txt count=20 shuffle=1 nonce=1

id status seconds prompt_tokens completion_tokens total_tokens
10  200  24.386988  57  4096  4153
11  200  24.412515  60  4096  4156
12  200  24.397616  76  4096  4172
13  200  14.089454  68  2362  2430
14  200  21.222193  67  3552  3619
15  200  13.490376  56  2276  2332
16  200  21.625402  62  3628  3690
17  200  18.498888  57  3095  3152
18  200  24.454264  61  4096  4157
19  200  17.620515  59  2972  3031
1   200  17.782799  61  2586  2647
20  200  19.214265  56  3233  3289
21  200  18.161814  74  3055  3129
22  200  24.359344  67  4096  4163
23  200  24.411447  57  4096  4153
24  200  24.442081  63  4096  4159
25  200  14.531877  60  2440  2500
26  200  13.009990  63  2177  2240
27  200  6.885480   58  1152  1210
28  200  24.296608  76  4096  4172
29  200  15.699833  60  2643  2703
2   200  18.248555  56  2656  2712
30  200  6.860422   59  1154  1213
31  200  23.504682  63  4096  4159
32  200  17.236493  72  3049  3121
3   200  26.789070  67  4096  4163
4   200  26.788966  76  4096  4172
5   200  21.046680  73  3479  3552
6   200  15.875098  62  2615  2677
7   200  15.088136  62  2481  2543
8   200  24.676287  54  4096  4150
9   200  22.677346  66  3814  3880

==> Summary
ok=32 fail=0
wall_sec=163.173
avg_req_sec=19.556 max_req_sec=26.789
prompt_tokens=2028 completion_tokens=103571 total_tokens=105599
completion_tok_per_sec=634.73
total_tok_per_sec=647.16

MTP / Speculative Decoding performance report

The following measurements were conducted on a single NVIDIA H200 NVL 141GB with --speculative-config '{"method":"mtp","num_speculative_tokens":2}'. See the Makefile for the exact parameters used to run the vLLM server with the benchmarked configuration.

Metric Samples Min Max Mean
Mean Acceptance Length 16 2.24 2.67 2.38
Avg Draft Acceptance Rate 16 62.1% 83.4% 69.2%
Per-position Acceptance Rate P1 16 73.1% 89.4% 78.7%
Per-position Acceptance Rate P2 16 51.2% 77.5% 59.7%

Serving

vllm serve ./Ornith-1.0-35B-FP8-MTP \
  --served-model-name ornith-35b-fp8-mtp \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
Downloads last month
-
Safetensors
Model size
36B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kyr0/Ornith-35B-FP8-E4M3-MTP

Quantized
(1)
this model