Cosmos T2-Accelerate-beta

Cosmos T2-Accelerate-beta

Universal Kaggle-ready training notebook for the Cosmos T2-Accelerate-beta series.

Notebook-generated card. Final metrics are filled after the Kaggle training run. This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant.

Model Details

Model class CosmosT2_Accelerate_LLM
Architecture Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path
Parameters ~5.03 M
Layers 6
Attention heads 2
KV heads 1
d_model 32
FFN hidden 256
Positional encoding RoPE (rope_base=10000)
Normalization RMSNorm
MLP SwiGLU
Memory Engram (use_engram=True, every 2 blocks)
Context length 1028
Training block size 1028
Tokenizer Qwen/Qwen2.5-0.5B
Dataset wop/XXXXXL-chain-of-thought
License Apache-2.0

Why these choices

  • RoPE keeps positional handling compact and avoids learned absolute embeddings.
  • RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
  • SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
  • GQA reduces KV cost while keeping multi-head query capacity.
  • Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.
  • Dynamic isolated batching keeps conversations separate while padding and masking each batch on CPU.
  • KV-cache generation avoids recomputing the full prompt for every generated token in the app.

Training Summary

Metric Value
Rows used 100,000
Loss tokens seen 22,307,381
Epochs 2
Batch size 2
Peak LR 3.00e-04
Weight decay 0.1
Gradient clipping 1.0
Wall-clock time 1h 16m 27s
Final training loss 3.6952
Final training perplexity 40.25
Final validation loss 6.3866
Final validation perplexity 593.83
Best validation loss 4.0744
Best epoch 2

Loss and perplexity

The notebook shows live loss and perplexity plots every 20 epochs and does not save the graph to disk.

How to Use

Quick start

import torch
from transformers import AutoTokenizer

from app import CosmosT2_Accelerate_LLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu")
model = CosmosT2_Accelerate_LLM(**ckpt["config"])
model.load_state_dict(ckpt["model_state"])
model.eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION"},
        {"role": "user", "content": "What is 12 * 7?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))

Prompt format

Use the Qwen2.5 chat template. The default system prompt is:

Enable thinking features: INTUITION

The model will then emit a <think> block followed by an answer when it has enough signal.

The model is trained to end its turn with the <|im_end|> token (ChatML), so generation stops there. During data prep, any example longer than the 1028-token context has its <think> reasoning replaced by a short placeholder (or is dropped) so every training sequence ends cleanly - the model is never trained on a mid-thought truncation.

Limitations

  • The model is intentionally small and is still a research/demo artifact.
  • Training on chain-of-thought data can overfit quickly if the corpus is tiny.
  • Long-context behavior is limited by the configured block size.
  • The model is not safety-aligned and should not be exposed as a public assistant without additional work.

Intended Use

  • Research into small-scale pretraining and reasoning-style formatting
  • Educational demos for decoder-only Transformer training
  • Hugging Face Spaces or local inference demos
  • Not for production use

Cosmos T2-Accelerate-beta Series

This notebook is designed to train future Cosmos T2-Accelerate-beta variants by changing only the config block at the top.

Citation

@misc{cosmos-t2,
  author    = {wop},
  title     = {Cosmos-T2: A small from-scratch chain-of-thought Transformer},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/wop/Cosmos-T2-Accelerate-beta}
}

Acknowledgements

  • Tokenizer from Qwen2.5 by Alibaba Cloud
  • Training data from wop/XXXXXL-chain-of-thought
  • Trained on Kaggle T4 GPUs
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train wop/Cosmos-T2-Accelerate-beta

Spaces using wop/Cosmos-T2-Accelerate-beta 3

Evaluation results

  • Final training loss (cross-entropy) on wop/XXXXXL-chain-of-thought
    self-reported
    3.695
  • Final training perplexity on wop/XXXXXL-chain-of-thought
    self-reported
    40.250
  • Final validation loss (cross-entropy) on wop/XXXXXL-chain-of-thought
    self-reported
    6.387
  • Final validation perplexity on wop/XXXXXL-chain-of-thought
    self-reported
    593.830