Cosmos-T-80M

Cosmos-T-80M

Cosmos-T-80M is the first model in the Cosmos-T series β€” small, from-scratch, decoder-only Transformers pretrained on chain-of-thought data for research and demos. It is an instruct-style model trained with explicit <think>...</think> reasoning blocks.

⚠️ Research / demo model. 80M parameters trained on only ~215k tokens. It is intentionally small so you can run it on a free Kaggle T4 or in a HF Space demo. It is not a useful general assistant and will produce incoherent or hallucinated output on most prompts. The point of this release is the architecture + training recipe, not state-of-the-art quality.


Model Details

Architecture Decoder-only Transformer (GPT-style, pre-norm, causal SDPA)
Parameters ~79.7 M
Layers (attention blocks) 12
d_model 384
Attention heads 8 (head_dim = 48)
FFN hidden 1536 (4 Γ— d_model)
Activation GELU
Normalization LayerNorm, pre-norm
Positional encoding Learned absolute
Embedding ↔ LM head Tied
Context length MAX_LEN) 1028
Training block size 1028 tokens
Vocab size 151,936
Tokenizer Qwen/Qwen2.5-0.5B (reused, not retrained)
License Apache-2.0

Why these choices

  • Tied embeddings β€” without tying, the 152k Qwen vocab alone would cost ~117M params (embed + head) and blow the <100M budget. Tying saves ~58M.
  • 12 attention layers β€” informed by the prior ablation (1 vs 12 layers) showing depth meaningfully improves the model's capacity to fit chain-of-thought reasoning patterns. See the research report for details.
  • Qwen2.5 tokenizer β€” already understands <think>, has good multilingual coverage, and is well-supported by transformers.

Architecture Diagram

Open wop/Cosmos-T-80M in hfviewer
Input tokens  (Qwen2.5 vocab = 151,936)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Token Embedding  (152k Γ— 384)    β”‚ ← tied with LM head
β”‚ + Positional Embedding (1028Γ—384)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Transformer Block Γ— 12     β”‚
   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
   β”‚  β”‚ LayerNorm             β”‚  β”‚
   β”‚  β”‚ Causal Self-Attention β”‚  β”‚  8 heads, fused SDPA
   β”‚  β”‚ + residual            β”‚  β”‚
   β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
   β”‚  β”‚ LayerNorm             β”‚  β”‚
   β”‚  β”‚ MLP: 384 β†’ 1536 β†’ 384 β”‚  β”‚  GELU
   β”‚  β”‚ + residual            β”‚  β”‚
   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Final LayerNorm                  β”‚
β”‚ LM head = tok_emb.T  (tied)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   Logits (B, T, 151936)

Training

Dataset wop/XXXXXL-chain-of-thought (840 conversations, chain-of-thought format with <think> blocks)
Approx. tokens seen / epoch ~215k
Epochs 50
Total optimizer steps 1,650
Batch size 6 (split across 2 GPUs)
Optimizer AdamW (Ξ² = 0.9, 0.95), weight decay 0.1
Peak LR 3 Γ— 10⁻⁴
LR schedule 50-step linear warmup β†’ cosine decay to 10% of peak
Gradient clipping 1.0
Precision FP16 autocast + GradScaler
Hardware Kaggle Notebook, 2 Γ— NVIDIA T4 (DataParallel)
Wall-clock time 772 seconds (~13 minutes)
Final training loss 0.4533 (perplexity β‰ˆ 1.57)
Final validation loss 7.0868 (perplexity β‰ˆ 1196)

Loss Curve

Loss curve

The training loss descends cleanly to ~0.45, but the validation loss bottoms out around step 300 (val β‰ˆ 5.6) and then climbs to 7.09 by step 1650. This is heavy overfitting, and is the expected behavior for an 80M-parameter model trained on only ~215k tokens (roughly 0.005 tokens per parameter, ~4000Γ— below Chinchilla-optimal).


Evaluation Results

This model has not been evaluated on standard reasoning benchmarks (GSM8K, MMLU, etc.) because:

  1. It is far below the scale where those benchmarks produce meaningful signal.
  2. The pretraining corpus is 840 examples β€” orders of magnitude too small for general capability.

The numbers below are the only evaluation metrics that are meaningful at this scale:

Metric Split Value
Cross-entropy loss train 0.4533
Perplexity train 1.57
Cross-entropy loss validation (5% held-out) 7.0868
Perplexity validation 1196.1

Interpretation: the model has memorized the reasoning style and most of the surface patterns of the chain-of-thought corpus (train perplexity ~1.57 is extremely low for a from-scratch model β€” close to memorization), but does not generalize to held-out conversations.


How to Use

Quick start

import torch
from transformers import AutoTokenizer

# Load tokenizer (reused from Qwen2.5)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load weights
ckpt = torch.load("mini_cot_gpt.pt", map_location="cuda")
config = ckpt["config"]

# Rebuild model (see model.py for the MiniGPT class)
from model import MiniGPT
model = MiniGPT(**config).cuda()
model.load_state_dict(ckpt["model_state"])
model.eval()

# Generate
prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
        {"role": "user",   "content": "What is 12 * 7?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))

Prompt format

Cosmos-T uses the Qwen2.5 chat template. To activate chain-of-thought reasoning, use a system prompt like:

Enable thinking features: INTUITION, COLD START, HOT START

The model will then produce a <think>...</think> block followed by an answer (when it works at all β€” see limitations).


Limitations

  • Tiny pretraining corpus (840 conversations). The model is heavily overfit and will hallucinate confidently on anything outside its training distribution.
  • No instruction tuning or RLHF beyond the original CoT-formatted pretraining data.
  • English only in practice (although the Qwen tokenizer is multilingual).
  • Not safety-aligned. No refusal training, no toxicity filtering. Do not deploy in user-facing applications.
  • Short context in training (1028-token blocks), even though MAX_LEN=1028. Long-context behavior is untested.
  • Single training seed. No error bars on the loss numbers.

Intended Use

  • βœ… Research into small-scale pretraining, chain-of-thought formatting, and depth ablations
  • βœ… Educational demos showing how a from-scratch Transformer is built and trained
  • βœ… HuggingFace Space demos illustrating CoT-style generation
  • ❌ Production use of any kind
  • ❌ Generating factual content
  • ❌ User-facing assistants

Cosmos-T Series

This is the first release in the Cosmos-T series. Planned future variants:

  • A width-matched 1-layer baseline (for clean depth ablation)
  • A longer-trained 12-layer variant with early stopping at best val loss
  • Potentially larger CoT pretraining corpora

Citation

@misc{cosmos-t-80m,
  author       = {wop},
  title        = {Cosmos-T-80M: A small from-scratch chain-of-thought Transformer},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/wop/Cosmos-T-80M}
}

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train wop/Cosmos-T-80M

Spaces using wop/Cosmos-T-80M 2

Collection including wop/Cosmos-T-80M

Evaluation results

  • Final training loss (cross-entropy) on XXXXXL-chain-of-thought
    self-reported
    0.453
  • Final training perplexity on XXXXXL-chain-of-thought
    self-reported
    1.570
  • Final validation loss (cross-entropy) on XXXXXL-chain-of-thought
    self-reported
    7.087
  • Final validation perplexity on XXXXXL-chain-of-thought
    self-reported
    1196.100