Cosmos-T2-Accelerate-Preview

Cosmos-T2-Accelerate-Preview

A preview release of the Cosmos-T2-Accelerate series β€” a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.

⚠️ Preview / research checkpoint. Tiny (β‰ˆ10M params, d_model=64, 4 layers). It will hallucinate freely and locks into the <think>…</think> Answer: N GSM8K-style template. Use it to study the architecture and the training recipe, not for production.

Try it

πŸš€ Live demo: wop/Cosmos-T2-Accelerate-Preview-DEMO

Model Details

Model class CosmosT2_Accelerate_LLM
Architecture Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path
Parameters ~9.96 M
Layers 4
Attention heads 4
KV heads 1 (GQA)
d_model 64
FFN hidden 256
Positional encoding RoPE (rope_base=10000, NeoX-style interleaved)
Normalization RMSNorm
MLP SwiGLU
Memory Engram (use_engram=True, every 2 blocks, 128 buckets, dim=16, order=3)
Context length 1028
Training block size 1028
Tokenizer Qwen/Qwen2.5-0.5B
Vocab size 151665
Dataset wop/XXXXXL-chain-of-thought
License Apache-2.0

Why these choices

  • RoPE keeps positional handling compact and avoids learned absolute embeddings.
  • RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
  • SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
  • GQA reduces KV cost while keeping multi-head query capacity.
  • Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.

Training Summary

Metric Value
Rows used 10,000
Approx. packed tokens (after padding) 461,150,000+ (50 epochs Γ— 75 000 steps Γ— 1 028 tokens/step β‰ˆ 462.1M total trained tokens)
Epochs 50
Batch size 6
Peak LR 3e-4
Weight decay 0.1
Warmup steps 50
Gradient clipping 1.0
Wall-clock time 4h 58m 00s on 2Γ— T4 (Kaggle)
Final training loss 2.2055
Final training perplexity 9.08
Final validation loss 2.3608
Final validation perplexity 10.60
Best validation loss 2.3585
Best epoch 47

history.json contains the full step-level and epoch-level training/validation curves.

Files in this repo

File Description
Cosmos-T2-Accelerate-Preview.pt Final-epoch checkpoint (epoch 50).
Cosmos-T2-Accelerate-Preview.best.pt Best-validation checkpoint (epoch 47). Recommended.
model_config.json Full architecture + training config.
history.json Step-level + epoch-level loss/ppl curves and final metrics.
README.md This file.

Both .pt files are PyTorch dicts with the following layout:

{
    "model_state":   state_dict,       # nn.Module state dict
    "config":        {...},            # architecture config (see model_config.json)
    "tokenizer_name": "Qwen/Qwen2.5-0.5B",
    "history":       {...},            # training curves
    "best_epoch":    47,
    "best_val_loss": 2.3584773325920105,
}

How to Use

Quick start

import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM   # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`

REPO   = "wop/Cosmos-T2-Accelerate-Preview"
CKPT   = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg  = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
    n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
    max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
    engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
    engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
    pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION"},
        {"role": "user",   "content": "What is 2 + 2?"},
    ],
    tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))

System prompt

The notebook uses a single fixed system prompt during training:

Enable thinking features: INTUITION

Using a different system prompt at inference time tends to degrade quality.

Known limitations

  • Size. ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
  • Template lock-in. The model produces <think>...</think> Answer: N for nearly every prompt, regardless of whether the task is math.
  • No KV cache. The bundled generate() recomputes the full context each step β€” fine for a tiny model and short contexts, slow for long ones.
  • RoPE flavour. This checkpoint was trained with NeoX-style interleaved RoPE (cos/sin built with repeat_interleave(2, dim=-1)), not Llama-style concatenated RoPE. The reference app.py in the demo space uses the matching layout β€” if you port the code elsewhere, make sure build_rope and rotate_half are paired correctly.

Citation / Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train wop/Cosmos-T2-Accelerate-Preview

Spaces using wop/Cosmos-T2-Accelerate-Preview 3

Evaluation results

  • Final training loss (cross-entropy) on wop/XXXXXL-chain-of-thought
    self-reported
    2.205
  • Final training perplexity on wop/XXXXXL-chain-of-thought
    self-reported
    9.080
  • Final validation loss (cross-entropy) on wop/XXXXXL-chain-of-thought
    self-reported
    2.361
  • Final validation perplexity on wop/XXXXXL-chain-of-thought
    self-reported
    10.600