Cosmos-T-80M

Cosmos-T-80M is the first model in the Cosmos-T series — small, from-scratch, decoder-only Transformers pretrained on chain-of-thought data for research and demos. It is an instruct-style model trained with explicit <think>...</think> reasoning blocks.

⚠️ Research / demo model. 80M parameters trained on only ~215k tokens. It is intentionally small so you can run it on a free Kaggle T4 or in a HF Space demo. It is not a useful general assistant and will produce incoherent or hallucinated output on most prompts. The point of this release is the architecture + training recipe, not state-of-the-art quality.

Model Details


Architecture	Decoder-only Transformer (GPT-style, pre-norm, causal SDPA)
Parameters	~79.7 M
Layers (attention blocks)	12
d_model	384
Attention heads	8 (head_dim = 48)
FFN hidden	1536 (4 × d_model)
Activation	GELU
Normalization	LayerNorm, pre-norm
Positional encoding	Learned absolute
Embedding ↔ LM head	Tied
Context length `MAX_LEN`)	1028
Training block size	1028 tokens
Vocab size	151,936
Tokenizer	`Qwen/Qwen2.5-0.5B` (reused, not retrained)
License	Apache-2.0

Why these choices

Tied embeddings — without tying, the 152k Qwen vocab alone would cost ~117M params (embed + head) and blow the <100M budget. Tying saves ~58M.
12 attention layers — informed by the prior ablation (1 vs 12 layers) showing depth meaningfully improves the model's capacity to fit chain-of-thought reasoning patterns. See the research report for details.
Qwen2.5 tokenizer — already understands <think>, has good multilingual coverage, and is well-supported by transformers.

Architecture Diagram

Input tokens  (Qwen2.5 vocab = 151,936)
        │
        ▼
┌──────────────────────────────────┐
│ Token Embedding  (152k × 384)    │ ← tied with LM head
│ + Positional Embedding (1028×384)│
└──────────────────────────────────┘
        │
        ▼
   ┌─────────────────────────────┐
   │  Transformer Block × 12     │
   │  ┌───────────────────────┐  │
   │  │ LayerNorm             │  │
   │  │ Causal Self-Attention │  │  8 heads, fused SDPA
   │  │ + residual            │  │
   │  ├───────────────────────┤  │
   │  │ LayerNorm             │  │
   │  │ MLP: 384 → 1536 → 384 │  │  GELU
   │  │ + residual            │  │
   │  └───────────────────────┘  │
   └─────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│ Final LayerNorm                  │
│ LM head = tok_emb.T  (tied)      │
└──────────────────────────────────┘
        │
        ▼
   Logits (B, T, 151936)

Training


Dataset	`wop/XXXXXL-chain-of-thought` (840 conversations, chain-of-thought format with `<think>` blocks)
Approx. tokens seen / epoch	~215k
Epochs	50
Total optimizer steps	1,650
Batch size	6 (split across 2 GPUs)
Optimizer	AdamW (β = 0.9, 0.95), weight decay 0.1
Peak LR	3 × 10⁻⁴
LR schedule	50-step linear warmup → cosine decay to 10% of peak
Gradient clipping	1.0
Precision	FP16 autocast + GradScaler
Hardware	Kaggle Notebook, 2 × NVIDIA T4 (DataParallel)
Wall-clock time	772 seconds (~13 minutes)
Final training loss	0.4533 (perplexity ≈ 1.57)
Final validation loss	7.0868 (perplexity ≈ 1196)

Loss Curve

The training loss descends cleanly to ~0.45, but the validation loss bottoms out around step 300 (val ≈ 5.6) and then climbs to 7.09 by step 1650. This is heavy overfitting, and is the expected behavior for an 80M-parameter model trained on only ~215k tokens (roughly 0.005 tokens per parameter, ~4000× below Chinchilla-optimal).

Evaluation Results

This model has not been evaluated on standard reasoning benchmarks (GSM8K, MMLU, etc.) because:

It is far below the scale where those benchmarks produce meaningful signal.
The pretraining corpus is 840 examples — orders of magnitude too small for general capability.

The numbers below are the only evaluation metrics that are meaningful at this scale:

Metric	Split	Value
Cross-entropy loss	train	0.4533
Perplexity	train	1.57
Cross-entropy loss	validation (5% held-out)	7.0868
Perplexity	validation	1196.1

Interpretation: the model has memorized the reasoning style and most of the surface patterns of the chain-of-thought corpus (train perplexity ~1.57 is extremely low for a from-scratch model — close to memorization), but does not generalize to held-out conversations.

How to Use

Quick start

import torch
from transformers import AutoTokenizer

# Load tokenizer (reused from Qwen2.5)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load weights
ckpt = torch.load("mini_cot_gpt.pt", map_location="cuda")
config = ckpt["config"]

# Rebuild model (see model.py for the MiniGPT class)
from model import MiniGPT
model = MiniGPT(**config).cuda()
model.load_state_dict(ckpt["model_state"])
model.eval()

# Generate
prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
        {"role": "user",   "content": "What is 12 * 7?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))

Prompt format

Cosmos-T uses the Qwen2.5 chat template. To activate chain-of-thought reasoning, use a system prompt like:

Enable thinking features: INTUITION, COLD START, HOT START

The model will then produce a <think>...</think> block followed by an answer (when it works at all — see limitations).

Limitations

Tiny pretraining corpus (840 conversations). The model is heavily overfit and will hallucinate confidently on anything outside its training distribution.
No instruction tuning or RLHF beyond the original CoT-formatted pretraining data.
English only in practice (although the Qwen tokenizer is multilingual).
Not safety-aligned. No refusal training, no toxicity filtering. Do not deploy in user-facing applications.
Short context in training (1028-token blocks), even though MAX_LEN=1028. Long-context behavior is untested.
Single training seed. No error bars on the loss numbers.

Intended Use

✅ Research into small-scale pretraining, chain-of-thought formatting, and depth ablations
✅ Educational demos showing how a from-scratch Transformer is built and trained
✅ HuggingFace Space demos illustrating CoT-style generation
❌ Production use of any kind
❌ Generating factual content
❌ User-facing assistants

Cosmos-T Series

This is the first release in the Cosmos-T series. Planned future variants:

A width-matched 1-layer baseline (for clean depth ablation)
A longer-trained 12-layer variant with early stopping at best val loss
Potentially larger CoT pretraining corpora

Citation

@misc{cosmos-t-80m,
  author       = {wop},
  title        = {Cosmos-T-80M: A small from-scratch chain-of-thought Transformer},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/wop/Cosmos-T-80M}
}

Acknowledgements

Tokenizer from Qwen2.5 by Alibaba Cloud
Training data from wop/XXXXXL-chain-of-thought
Trained on free Kaggle T4 GPUs

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train wop/Cosmos-T-80M

Spaces using wop/Cosmos-T-80M 2

Collection including wop/Cosmos-T-80M

Useful things

Collection

9 items • Updated 1 day ago • 2

Evaluation results

Final training loss (cross-entropy) on XXXXXL-chain-of-thought
self-reported

0.453
Final training perplexity on XXXXXL-chain-of-thought
self-reported

1.570
Final validation loss (cross-entropy) on XXXXXL-chain-of-thought
self-reported

7.087
Final validation perplexity on XXXXXL-chain-of-thought
self-reported

1196.100