Instructions to use protoLabsAI/Ornith-1.0-35B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use protoLabsAI/Ornith-1.0-35B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="protoLabsAI/Ornith-1.0-35B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("protoLabsAI/Ornith-1.0-35B-FP8")
model = AutoModelForMultimodalLM.from_pretrained("protoLabsAI/Ornith-1.0-35B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use protoLabsAI/Ornith-1.0-35B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "protoLabsAI/Ornith-1.0-35B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "protoLabsAI/Ornith-1.0-35B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/protoLabsAI/Ornith-1.0-35B-FP8

SGLang

How to use protoLabsAI/Ornith-1.0-35B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "protoLabsAI/Ornith-1.0-35B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "protoLabsAI/Ornith-1.0-35B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "protoLabsAI/Ornith-1.0-35B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "protoLabsAI/Ornith-1.0-35B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use protoLabsAI/Ornith-1.0-35B-FP8 with Docker Model Runner:
```
docker model run hf.co/protoLabsAI/Ornith-1.0-35B-FP8
```

Dual RTX 5090 test

by SamirGFX - opened 2 days ago

Discussion

SamirGFX

2 days ago

I have found that running this model on my dual RTX 5090 setup using vLLM delivers excellent performance. I requested that it generate an FPS shooter game in HTML, and it produced approximately 5,400 lines of HTML, JavaScript, and CSS within five minutes. However, the resulting game was entirely non-functional, and the model spent the subsequent hour attempting to resolve the issues without success. It repeatedly entered repetitive loops without making any progress.

I then asked it to rewrite the same game as a single HTML file. It generated approximately 2,800 lines of HTML, JavaScript, and CSS in about eight minutes. Unfortunately, the game contained numerous defects, and the model was unable to resolve them over the following 20 minutes, again cycling through the same unproductive patterns for the entire duration.

In comparison, Qwen3.6-27B remains the superior model for this type of task.

SamirGFX changed discussion title from Dual RTX 5090 to Dual RTX 5090 test 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment