Instructions to use wop/Cosmos-T-80M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wop/Cosmos-T-80M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="wop/Cosmos-T-80M")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("wop/Cosmos-T-80M", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use wop/Cosmos-T-80M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wop/Cosmos-T-80M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T-80M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/wop/Cosmos-T-80M
- SGLang
How to use wop/Cosmos-T-80M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T-80M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T-80M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T-80M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T-80M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use wop/Cosmos-T-80M with Docker Model Runner:
docker model run hf.co/wop/Cosmos-T-80M
Cosmos-T-80M
Cosmos-T-80M is the first model in the Cosmos-T series β small, from-scratch, decoder-only Transformers pretrained on chain-of-thought data for research and demos. It is an instruct-style model trained with explicit <think>...</think> reasoning blocks.
β οΈ Research / demo model. 80M parameters trained on only ~215k tokens. It is intentionally small so you can run it on a free Kaggle T4 or in a HF Space demo. It is not a useful general assistant and will produce incoherent or hallucinated output on most prompts. The point of this release is the architecture + training recipe, not state-of-the-art quality.
Model Details
| Architecture | Decoder-only Transformer (GPT-style, pre-norm, causal SDPA) |
| Parameters | ~79.7 M |
| Layers (attention blocks) | 12 |
| d_model | 384 |
| Attention heads | 8 (head_dim = 48) |
| FFN hidden | 1536 (4 Γ d_model) |
| Activation | GELU |
| Normalization | LayerNorm, pre-norm |
| Positional encoding | Learned absolute |
| Embedding β LM head | Tied |
Context length MAX_LEN) |
1028 |
| Training block size | 1028 tokens |
| Vocab size | 151,936 |
| Tokenizer | Qwen/Qwen2.5-0.5B (reused, not retrained) |
| License | Apache-2.0 |
Why these choices
- Tied embeddings β without tying, the 152k Qwen vocab alone would cost ~117M params (embed + head) and blow the <100M budget. Tying saves ~58M.
- 12 attention layers β informed by the prior ablation (1 vs 12 layers) showing depth meaningfully improves the model's capacity to fit chain-of-thought reasoning patterns. See the research report for details.
- Qwen2.5 tokenizer β already understands
<think>, has good multilingual coverage, and is well-supported bytransformers.
Architecture Diagram
Input tokens (Qwen2.5 vocab = 151,936)
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Token Embedding (152k Γ 384) β β tied with LM head
β + Positional Embedding (1028Γ384)β
ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Transformer Block Γ 12 β
β βββββββββββββββββββββββββ β
β β LayerNorm β β
β β Causal Self-Attention β β 8 heads, fused SDPA
β β + residual β β
β βββββββββββββββββββββββββ€ β
β β LayerNorm β β
β β MLP: 384 β 1536 β 384 β β GELU
β β + residual β β
β βββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Final LayerNorm β
β LM head = tok_emb.T (tied) β
ββββββββββββββββββββββββββββββββββββ
β
βΌ
Logits (B, T, 151936)
Training
| Dataset | wop/XXXXXL-chain-of-thought (840 conversations, chain-of-thought format with <think> blocks) |
| Approx. tokens seen / epoch | ~215k |
| Epochs | 50 |
| Total optimizer steps | 1,650 |
| Batch size | 6 (split across 2 GPUs) |
| Optimizer | AdamW (Ξ² = 0.9, 0.95), weight decay 0.1 |
| Peak LR | 3 Γ 10β»β΄ |
| LR schedule | 50-step linear warmup β cosine decay to 10% of peak |
| Gradient clipping | 1.0 |
| Precision | FP16 autocast + GradScaler |
| Hardware | Kaggle Notebook, 2 Γ NVIDIA T4 (DataParallel) |
| Wall-clock time | 772 seconds (~13 minutes) |
| Final training loss | 0.4533 (perplexity β 1.57) |
| Final validation loss | 7.0868 (perplexity β 1196) |
Loss Curve
The training loss descends cleanly to ~0.45, but the validation loss bottoms out around step 300 (val β 5.6) and then climbs to 7.09 by step 1650. This is heavy overfitting, and is the expected behavior for an 80M-parameter model trained on only ~215k tokens (roughly 0.005 tokens per parameter, ~4000Γ below Chinchilla-optimal).
Evaluation Results
This model has not been evaluated on standard reasoning benchmarks (GSM8K, MMLU, etc.) because:
- It is far below the scale where those benchmarks produce meaningful signal.
- The pretraining corpus is 840 examples β orders of magnitude too small for general capability.
The numbers below are the only evaluation metrics that are meaningful at this scale:
| Metric | Split | Value |
|---|---|---|
| Cross-entropy loss | train | 0.4533 |
| Perplexity | train | 1.57 |
| Cross-entropy loss | validation (5% held-out) | 7.0868 |
| Perplexity | validation | 1196.1 |
Interpretation: the model has memorized the reasoning style and most of the surface patterns of the chain-of-thought corpus (train perplexity ~1.57 is extremely low for a from-scratch model β close to memorization), but does not generalize to held-out conversations.
How to Use
Quick start
import torch
from transformers import AutoTokenizer
# Load tokenizer (reused from Qwen2.5)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load weights
ckpt = torch.load("mini_cot_gpt.pt", map_location="cuda")
config = ckpt["config"]
# Rebuild model (see model.py for the MiniGPT class)
from model import MiniGPT
model = MiniGPT(**config).cuda()
model.load_state_dict(ckpt["model_state"])
model.eval()
# Generate
prompt = tokenizer.apply_chat_template(
[
{"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
{"role": "user", "content": "What is 12 * 7?"},
],
tokenize=False,
add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))
Prompt format
Cosmos-T uses the Qwen2.5 chat template. To activate chain-of-thought reasoning, use a system prompt like:
Enable thinking features: INTUITION, COLD START, HOT START
The model will then produce a <think>...</think> block followed by an answer (when it works at all β see limitations).
Limitations
- Tiny pretraining corpus (840 conversations). The model is heavily overfit and will hallucinate confidently on anything outside its training distribution.
- No instruction tuning or RLHF beyond the original CoT-formatted pretraining data.
- English only in practice (although the Qwen tokenizer is multilingual).
- Not safety-aligned. No refusal training, no toxicity filtering. Do not deploy in user-facing applications.
- Short context in training (1028-token blocks), even though
MAX_LEN=1028. Long-context behavior is untested. - Single training seed. No error bars on the loss numbers.
Intended Use
- β Research into small-scale pretraining, chain-of-thought formatting, and depth ablations
- β Educational demos showing how a from-scratch Transformer is built and trained
- β HuggingFace Space demos illustrating CoT-style generation
- β Production use of any kind
- β Generating factual content
- β User-facing assistants
Cosmos-T Series
This is the first release in the Cosmos-T series. Planned future variants:
- A width-matched 1-layer baseline (for clean depth ablation)
- A longer-trained 12-layer variant with early stopping at best val loss
- Potentially larger CoT pretraining corpora
Citation
@misc{cosmos-t-80m,
author = {wop},
title = {Cosmos-T-80M: A small from-scratch chain-of-thought Transformer},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/wop/Cosmos-T-80M}
}
Acknowledgements
- Tokenizer from Qwen2.5 by Alibaba Cloud
- Training data from
wop/XXXXXL-chain-of-thought - Trained on free Kaggle T4 GPUs
Dataset used to train wop/Cosmos-T-80M
Spaces using wop/Cosmos-T-80M 2
Collection including wop/Cosmos-T-80M
Evaluation results
- Final training loss (cross-entropy) on XXXXXL-chain-of-thoughtself-reported0.453
- Final training perplexity on XXXXXL-chain-of-thoughtself-reported1.570
- Final validation loss (cross-entropy) on XXXXXL-chain-of-thoughtself-reported7.087
- Final validation perplexity on XXXXXL-chain-of-thoughtself-reported1196.100
