Instructions to use Matmultoken/Qwen3.5-4B-pouw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Matmultoken/Qwen3.5-4B-pouw with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Matmultoken/Qwen3.5-4B-pouw") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Matmultoken/Qwen3.5-4B-pouw") model = AutoModelForMultimodalLM.from_pretrained("Matmultoken/Qwen3.5-4B-pouw") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Matmultoken/Qwen3.5-4B-pouw with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Matmultoken/Qwen3.5-4B-pouw" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Matmultoken/Qwen3.5-4B-pouw", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Matmultoken/Qwen3.5-4B-pouw
- SGLang
How to use Matmultoken/Qwen3.5-4B-pouw with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Matmultoken/Qwen3.5-4B-pouw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Matmultoken/Qwen3.5-4B-pouw", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Matmultoken/Qwen3.5-4B-pouw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Matmultoken/Qwen3.5-4B-pouw", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Matmultoken/Qwen3.5-4B-pouw with Docker Model Runner:
docker model run hf.co/Matmultoken/Qwen3.5-4B-pouw
Qwen3.5-4B-pouw
A self-contained pouw model, based on Qwen/Qwen3.5-4B. It bundles the full base weights (apache-2.0) together with the metadata that makes it mine MatMulToken Proof-of-Useful-Work while it serves — pull this one repo and it runs, no second download.
MatMulToken's mining is output-preserving: generation is bit-identical to the base model. The
eligible transformer matmuls (in_features == common_dim = 2560) are reused as PoW
lottery tickets — you serve real text and mine on the same compute, no second matmul.
It is GPU-agnostic (portable Triton/PyTorch kernels, no CUDA build): RTX 3090 (sm86) → 5090 → H100 → B200, same code.
Mining shape
| field | value |
|---|---|
| base model | Qwen/Qwen3.5-4B |
| modality | text |
| common_dim | 2560 |
| rank | 32 |
| mine_layers | 16 (overhead dial; layer count) |
| pipeline | vllm |
Mining regime (LLM)
Text LLMs mine during prefill — when many tokens are processed at once (rows = tokens is large). Single-token decode does not mine (rows ≈ 1), so interactive chat mines far less than long-prompt or batched-prefill serving. Diffusion models mine on every forward (large token count always), so for continuous mining a diffusion model (see Matmultoken/Z-Image-Turbo-pouw) is the stronger substrate; this LLM repo is for prefill-heavy / batch workloads.
Use
# Serve via vLLM with quantization="pouw" (vLLM-MatMulToken plugin auto-registers it).
from vllm import LLM
llm = LLM(model="Matmultoken/Qwen3.5-4B-pouw", quantization="pouw") # mines on eligible matmuls while it serves
print(llm.generate("The history of money is")) # generation is bit-identical to the base model
Notes
- The live PoW job + difficulty target always come from the chain at runtime — never baked into this repo. GPU kernels compile per-arch on first run (one-time, cached on disk).
- Published under the
Matmultokenorganization. The base weights (apache-2.0) are bundled in this repo at a pinned snapshot for a reproducible mining shape; the original model's LICENSE and attribution are preserved in-repo.
Generated by MatMulToken publish_pouw_models.py. License: MIT.
- Downloads last month
- 49