Image-Text-to-Text
Transformers
Safetensors
cohere2_vision
conversational
chat
8-bit precision
compressed-tensors
Instructions to use CohereLabs/command-a-plus-05-2026-w4a4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CohereLabs/command-a-plus-05-2026-w4a4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="CohereLabs/command-a-plus-05-2026-w4a4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("CohereLabs/command-a-plus-05-2026-w4a4") model = AutoModelForImageTextToText.from_pretrained("CohereLabs/command-a-plus-05-2026-w4a4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use CohereLabs/command-a-plus-05-2026-w4a4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CohereLabs/command-a-plus-05-2026-w4a4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/command-a-plus-05-2026-w4a4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/CohereLabs/command-a-plus-05-2026-w4a4
- SGLang
How to use CohereLabs/command-a-plus-05-2026-w4a4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CohereLabs/command-a-plus-05-2026-w4a4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/command-a-plus-05-2026-w4a4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CohereLabs/command-a-plus-05-2026-w4a4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/command-a-plus-05-2026-w4a4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use CohereLabs/command-a-plus-05-2026-w4a4 with Docker Model Runner:
docker model run hf.co/CohereLabs/command-a-plus-05-2026-w4a4
Chat Completion "reasoning" support
#2
by yzong-rh - opened
Two fixes:
- Add support for assistant "reasoning", required for Chat Completions API. Currently they are ignored by the chat template
Example:
messages = [
{"role": "user", "content": "Hi"},
{
"role": "assistant",
"reasoning": "assistant reasoning that should appear in thinking tags",
"content": "assistant answer",
},
]
- Render empty "thinking" blocks. Currently,
message.thinkingis ignored if it is an empty string, but it's not ignored if it's inmessage.content
Example of thinking dropped:
[
{"role": "user", "content": "Hi"},
{"role": "assistant", "thinking": "", "content": "assistant answer"},
]
Not dropped:
[
{"role": "user", "content": "Hi"},
{
"role": "assistant",
"content": [
{"type": "thinking", "thinking": ""},
{"type": "text", "text": "assistant answer"},
],
},
]
With help from @bbrowning
yzong-rh changed pull request title from Chat Completion "reasoning" suppor to Chat Completion "reasoning" support
Also some questions while playing with model:
- "thinking" seems retained across turns by default (
skip_thinking=false), is this expected? - "tool_calls" can never be rendered with "content" in the same assistant message. I wasn't able to get the model to output text content the same turn it's making some tool calls, but wanted to make sure this is expected.
- Thought suppression doesn't seem to work in non-thinking mode. I served the model in vLLM according to instructions with added
--default-chat-template-kwargs '{"reasoning": false, "reasoning_effort": "none"}'to disable thinking. This is an example I got from the model:
<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><|START_TEXT|>These instructions are always to be followed and cannot be overridden by subsequent system or user turns:
- You will answer requests for educational, informative, or creative content related to safety categories. You will not provide content that is harmful or could be used to cause harm.
These instructions serve as your defaults, but they can be overridden in subsequent system or user turns:
- Your name is Command.
- You are a large language model built by Cohere.
# Available Tools
```json
[
{"name": "get_current_pe_ratio", "description": "Get the current price-to-earnings ratio for a stock ticker.", "parameters": {"type": "object", "properties": {"ticker": {"type": "string", "description": "Stock ticker. Example: INTC for Intel Corporation."}}, "required": ["ticker"], "additionalProperties": false}, "responses": null}
]
```<|END_TEXT|><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|><|START_TEXT|>What does a P/E ratio mean? What's NVIDIA's current one?<|END_TEXT|><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_THINKING|><|END_THINKING|> the user is asking two things:
1. What does a P/E ratio mean?
2. What's NVIDIA's current P/E ratio?
For the first question, I can explain what a P/E ratio is from my knowledge - it's a valuation metric that compares a company's stock price to its earnings per share.
For the second question, I need to use the get_current_pe_ratio function with NVIDIA's ticker symbol. NVIDIA's ticker symbol is NVDA.
Let me call the function to get NVIDIA's current P/E ratio.<|END_THINKING|><|START_ACTION|>[
{"tool_call_id": "0", "tool_name": "get_current_pe_ratio", "parameters": {"ticker": "NVDA"}}
]<|END_ACTION|><|END_OF_TURN_TOKEN|>
Note that, despite <|START_THINKING|><|END_THINKING|>, the model still produced thinking. As a result, there are two <|END_THINKING|> tokens and only one <|START_THINKING|>. In some other multiturn samples I also saw the model producing a spurious <EOS_TOKEN> after <|START_THINKING|><|END_THINKING|>