Instructions to use StentorLabs/Portimbria-150M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use StentorLabs/Portimbria-150M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="StentorLabs/Portimbria-150M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M")
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Portimbria-150M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use StentorLabs/Portimbria-150M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "StentorLabs/Portimbria-150M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Portimbria-150M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/StentorLabs/Portimbria-150M

SGLang

How to use StentorLabs/Portimbria-150M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "StentorLabs/Portimbria-150M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Portimbria-150M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "StentorLabs/Portimbria-150M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Portimbria-150M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use StentorLabs/Portimbria-150M with Docker Model Runner:
```
docker model run hf.co/StentorLabs/Portimbria-150M
```

CPU inference viable? + AutoTokenizer pad token question

by AILover713 - opened 27 days ago

Discussion

AILover713

27 days ago

hey quick question before I run this locally...
so I'm about to pull down Portimbria-150M and test it out, I'm on a CPU-only machine with 16GB RAM. card says FP16 weights are only ~302MB which is fine but like... is CPU inference actually going to be usable at this size or am I going to be sitting there waiting 3 minutes per token lol
also does AutoTokenizer just work out of the box here or do I need to manually set a pad token? I've been burned by that before on other models and generation just silently breaks in weird ways
anyone who's already ran this able to chime in? 👍

StentorLabs

Owner 23 days ago

CPU inference at 151M: You won't be waiting 3 minutes per token at this size, but I don't have CPU throughput benchmarks yet so exact speed will depend on your hardware — best to just try it. What I do recommend for CPU: use INT8 dynamic quantization — it's in the card and drops the weight footprint to ~151MB with better throughput:
pythonmodel_int8 = torch.quantization.quantize_dynamic(
model.cpu(), {torch.nn.Linear}, dtype=torch.qint8
)
With 16GB RAM you're totally fine either way — total INT8 memory including KV cache is only ~231MB.
Pad token: Yes, handle it — you're right to flag this. Just pass pad_token_id=tokenizer.eos_token_id in your .generate() call, which is what all the example code in the card does. Silent breakage on generation is exactly the failure mode if you skip it.
Also don't forget repetition_penalty=1.1 — I called it non-negotiable in the card for a reason. Without it you'll get looping outputs on pattern-heavy prompts almost immediately.
Let me know how it runs, always interested in CPU perf reports!

AILover713

22 days ago

Thanks a lot this helped!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment