Instructions to use StentorLabs/Portimbria-150M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use StentorLabs/Portimbria-150M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="StentorLabs/Portimbria-150M") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M") model = AutoModelForCausalLM.from_pretrained("StentorLabs/Portimbria-150M") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use StentorLabs/Portimbria-150M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "StentorLabs/Portimbria-150M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StentorLabs/Portimbria-150M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/StentorLabs/Portimbria-150M
- SGLang
How to use StentorLabs/Portimbria-150M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "StentorLabs/Portimbria-150M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StentorLabs/Portimbria-150M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "StentorLabs/Portimbria-150M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StentorLabs/Portimbria-150M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use StentorLabs/Portimbria-150M with Docker Model Runner:
docker model run hf.co/StentorLabs/Portimbria-150M
CPU inference viable? + AutoTokenizer pad token question
hey quick question before I run this locally...
so I'm about to pull down Portimbria-150M and test it out, I'm on a CPU-only machine with 16GB RAM. card says FP16 weights are only ~302MB which is fine but like... is CPU inference actually going to be usable at this size or am I going to be sitting there waiting 3 minutes per token lol
also does AutoTokenizer just work out of the box here or do I need to manually set a pad token? I've been burned by that before on other models and generation just silently breaks in weird ways
anyone who's already ran this able to chime in? π
CPU inference at 151M: You won't be waiting 3 minutes per token at this size, but I don't have CPU throughput benchmarks yet so exact speed will depend on your hardware β best to just try it. What I do recommend for CPU: use INT8 dynamic quantization β it's in the card and drops the weight footprint to ~151MB with better throughput:
pythonmodel_int8 = torch.quantization.quantize_dynamic(
model.cpu(), {torch.nn.Linear}, dtype=torch.qint8
)
With 16GB RAM you're totally fine either way β total INT8 memory including KV cache is only ~231MB.
Pad token: Yes, handle it β you're right to flag this. Just pass pad_token_id=tokenizer.eos_token_id in your .generate() call, which is what all the example code in the card does. Silent breakage on generation is exactly the failure mode if you skip it.
Also don't forget repetition_penalty=1.1 β I called it non-negotiable in the card for a reason. Without it you'll get looping outputs on pattern-heavy prompts almost immediately.
Let me know how it runs, always interested in CPU perf reports!
Thanks a lot this helped!