Text Generation
Transformers
Safetensors
qwen3_5_moe
image-text-to-text
fp8
vllm
agentic-coding
Mixture of Experts
mtp
conversational
Instructions to use kyr0/Ornith-35B-FP8-E4M3-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kyr0/Ornith-35B-FP8-E4M3-MTP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kyr0/Ornith-35B-FP8-E4M3-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("kyr0/Ornith-35B-FP8-E4M3-MTP") model = AutoModelForMultimodalLM.from_pretrained("kyr0/Ornith-35B-FP8-E4M3-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kyr0/Ornith-35B-FP8-E4M3-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kyr0/Ornith-35B-FP8-E4M3-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyr0/Ornith-35B-FP8-E4M3-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kyr0/Ornith-35B-FP8-E4M3-MTP
- SGLang
How to use kyr0/Ornith-35B-FP8-E4M3-MTP with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kyr0/Ornith-35B-FP8-E4M3-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyr0/Ornith-35B-FP8-E4M3-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kyr0/Ornith-35B-FP8-E4M3-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyr0/Ornith-35B-FP8-E4M3-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kyr0/Ornith-35B-FP8-E4M3-MTP with Docker Model Runner:
docker model run hf.co/kyr0/Ornith-35B-FP8-E4M3-MTP
kyr0/Ornith-35B-FP8-E4M3-MTP
The same great deepreinforce-ai/Ornith-1.0-35B model, but:
- ~18% faster because of Speculative Decoding (MTP) model (only +800 MB VRAM consumption at runtime)
- in E4M3 FP8 precision (1 sign bit + 4 exponent bits + 3 mantissa bits) - smarter FP8 precision for slightly higher precision
- KV cache quantization using E4M3 FP8 as well
- FP8 model weights:
protoLabsAI/Ornith-1.0-35B-FP8(E4M3, 35.8 GB on disk) - FP8 MTP weights:
Capicua25x/Ornith-1.0-35B-MXFP4-Vision-MTP(1.6 GB on disk)
How does it work?
My graft script does a bit of magic:
- downloaded orignal model weights (target) and donor MTP sidecar weights via
hf - grafts MTP sidecar into target trunk
- extracts and adds
785mtp.*tensors tomodel.safetensors.index.json - skips MTP FP8 scale tensors
- marks
778MTP Linear modules as unquantized/ignored - sets
num_nextn_predict_layers=1 - leaves original target safetensors shards unchanged
Generation Performance (Token/sec generated) vs. Baseline
MTP / Speculative Decoding ENABLED
| Metric | Value |
|---|---|
| Successful / Failed Requests | 32 / 0 |
| Total Wall Time | 159.47 sec |
| Average Request Time | 19.11 sec |
| Max Request Time | 27.48 sec |
| Total Prompt Tokens | 1,944 |
| Total Completion Tokens | 119,769 |
| Total Tokens Processed | 121,713 |
| Completion Throughput | 751.06 tok/sec |
| Total Throughput | 763.25 tok/sec |
SERVED_NAME="ornith-35b-fp8-e4m3-mtp" \
REQS="32" \
PARALLEL="4" \
MAX_TOKENS="4096" \
TEMP="0.6" \
TOP_P="0.95" \
TOP_K="20" \
PROMPTS_FILE="" \
SHUFFLE_PROMPTS="1" \
RANDOM_NONCE="1" \
./stress.sh
==> Stress test
url=http://127.0.0.1:8998/v1/chat/completions
model=ornith-35b-fp8-e4m3-mtp
reqs=32 parallel=4 max_tokens=4096
prompts=/tmp/tmp.Lx42dinyo3/default-prompts.txt count=20 shuffle=1 nonce=1
id status seconds prompt_tokens completion_tokens total_tokens
10 200 15.468765 65 3458 3523
11 200 17.042949 57 3243 3300
12 200 17.226863 57 3213 3270
13 200 19.818083 62 3677 3739
14 200 20.236654 77 4096 4173
15 200 18.975290 68 4096 4164
16 200 19.780207 62 4096 4158
17 200 17.789587 58 3381 3439
18 200 19.140477 57 3906 3963
19 200 19.095998 66 4096 4162
1 200 26.982888 61 3764 3825
20 200 21.414041 62 4096 4158
21 200 18.708545 53 4096 4149
22 200 18.280602 64 4096 4160
23 200 16.905085 55 3445 3500
24 200 19.210371 53 4096 4149
25 200 20.722689 77 4096 4173
26 200 18.112546 59 4096 4155
27 200 16.745625 60 3269 3329
28 200 19.964553 57 4096 4153
29 200 20.624861 77 4096 4173
2 200 24.681820 56 3746 3802
30 200 18.958027 62 4096 4158
31 200 16.092866 58 3277 3335
32 200 14.789030 65 3922 3987
3 200 27.475965 55 4096 4151
4 200 23.244902 55 3343 3398
5 200 19.629941 52 4096 4148
6 200 12.735978 62 2328 2390
7 200 20.976643 61 4096 4157
8 200 19.030427 52 4096 4148
9 200 11.501070 59 2165 2224
==> Summary
ok=32 fail=0
wall_sec=159.467
avg_req_sec=19.105 max_req_sec=27.476
prompt_tokens=1944 completion_tokens=119769 total_tokens=121713
completion_tok_per_sec=751.06
total_tok_per_sec=763.25
Baseline: Speculative Decoding DISABLED
| Metric | Value |
|---|---|
| Successful / Failed Requests | 32 / 0 |
| Total Wall Time | 163.17 sec |
| Average Request Time | 19.56 sec |
| Max Request Time | 26.79 sec |
| Total Prompt Tokens | 2,028 |
| Total Completion Tokens | 103,571 |
| Total Tokens Processed | 105,599 |
| Completion Throughput | 634.73 tok/sec |
| Total Throughput | 647.16 tok/sec |
make stress
chmod +x ./stress.sh
PORT="8998" \
API_KEY="local-dev-key" \
SERVED_NAME="ornith-35b-fp8-e4m3-mtp" \
REQS="32" \
PARALLEL="4" \
MAX_TOKENS="4096" \
TEMP="0.6" \
TOP_P="0.95" \
TOP_K="20" \
PROMPTS_FILE="" \
SHUFFLE_PROMPTS="1" \
RANDOM_NONCE="1" \
./stress.sh
==> Stress test
url=http://127.0.0.1:8998/v1/chat/completions
model=ornith-35b-fp8-e4m3-mtp
reqs=32 parallel=4 max_tokens=4096
prompts=/tmp/tmp.0nT1f4XYq0/default-prompts.txt count=20 shuffle=1 nonce=1
id status seconds prompt_tokens completion_tokens total_tokens
10 200 24.386988 57 4096 4153
11 200 24.412515 60 4096 4156
12 200 24.397616 76 4096 4172
13 200 14.089454 68 2362 2430
14 200 21.222193 67 3552 3619
15 200 13.490376 56 2276 2332
16 200 21.625402 62 3628 3690
17 200 18.498888 57 3095 3152
18 200 24.454264 61 4096 4157
19 200 17.620515 59 2972 3031
1 200 17.782799 61 2586 2647
20 200 19.214265 56 3233 3289
21 200 18.161814 74 3055 3129
22 200 24.359344 67 4096 4163
23 200 24.411447 57 4096 4153
24 200 24.442081 63 4096 4159
25 200 14.531877 60 2440 2500
26 200 13.009990 63 2177 2240
27 200 6.885480 58 1152 1210
28 200 24.296608 76 4096 4172
29 200 15.699833 60 2643 2703
2 200 18.248555 56 2656 2712
30 200 6.860422 59 1154 1213
31 200 23.504682 63 4096 4159
32 200 17.236493 72 3049 3121
3 200 26.789070 67 4096 4163
4 200 26.788966 76 4096 4172
5 200 21.046680 73 3479 3552
6 200 15.875098 62 2615 2677
7 200 15.088136 62 2481 2543
8 200 24.676287 54 4096 4150
9 200 22.677346 66 3814 3880
==> Summary
ok=32 fail=0
wall_sec=163.173
avg_req_sec=19.556 max_req_sec=26.789
prompt_tokens=2028 completion_tokens=103571 total_tokens=105599
completion_tok_per_sec=634.73
total_tok_per_sec=647.16
MTP / Speculative Decoding performance report
The following measurements were conducted on a single NVIDIA H200 NVL 141GB with --speculative-config '{"method":"mtp","num_speculative_tokens":2}'. See the Makefile for the exact parameters used to run the vLLM server with the benchmarked configuration.
| Metric | Samples | Min | Max | Mean |
|---|---|---|---|---|
| Mean Acceptance Length | 16 | 2.24 | 2.67 | 2.38 |
| Avg Draft Acceptance Rate | 16 | 62.1% | 83.4% | 69.2% |
| Per-position Acceptance Rate P1 | 16 | 73.1% | 89.4% | 78.7% |
| Per-position Acceptance Rate P2 | 16 | 51.2% | 77.5% | 59.7% |
Serving
vllm serve ./Ornith-1.0-35B-FP8-MTP \
--served-model-name ornith-35b-fp8-mtp \
--trust-remote-code \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
- Downloads last month
- -
Model tree for kyr0/Ornith-35B-FP8-E4M3-MTP
Base model
deepreinforce-ai/Ornith-1.0-35B Quantized
protoLabsAI/Ornith-1.0-35B-FP8