Alcyone v0

A ~4M parameter GPT-2 style language model, pretrained from scratch on TinyStories as an educational exercise.

Named after the brightest star in the Pleiades cluster — part of a star-named model family.

Why this exists

Most "trained a model" portfolio entries are fine-tunes of large pretrained checkpoints. This is different: every weight in this model started as a random number. The goal was to walk through the full pretraining loop end-to-end — custom BPE tokenizer, randomly initialized GPT-2 architecture, causal LM objective, training loop, evaluation — on a free Google Colab T4 GPU.

The output isn't competitive with anything. The understanding is the deliverable.

Model details

Architecture GPT-2 (decoder-only Transformer)
Parameters ~4.22 M
Layers 4
Embedding dim 256
Attention heads 4
Context length 128 tokens
Vocab size 4,000
Tokenizer Byte-level BPE, trained from scratch on TinyStories
Initialization Random (no pretrained weights)
Training objective Causal language modeling (next-token prediction)

Training data

roneneldan/TinyStories — short children's stories generated by GPT-3.5/GPT-4, designed specifically so small language models can learn coherent English. A 5,000-story subset was used for this v0 run.

Training setup

Hardware Google Colab Free Tier (NVIDIA Tesla T4, 15.6 GB VRAM)
Precision fp16
Optimizer AdamW (Hugging Face Trainer default)
Learning rate 3e-4, cosine schedule, 30 warmup steps
Weight decay 0.01
Batch size 32
Steps 300 (capped via max_steps)
Wall-clock time ~14 seconds
Final train loss 5.08 (from initial ~9, vocab=4000)

Intended use & limitations

This is a base language model, not an instruction-tuned chat model. Given a short English prompt, it will continue the text in TinyStories style (children's stories with characters like "Lily", "Tom", "Ben", simple plots).

It will NOT:

  • follow instructions
  • answer questions reliably
  • produce text outside the children's-story domain
  • maintain long-range coherence (context window is only 128 tokens)

This is v0 — explicitly an early, undercooked checkpoint. Output is occasionally repetitive and loses thread across sentences. That's expected at this scale.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model     = AutoModelForCausalLM.from_pretrained("laskar-ks/alcyone-v0")
tokenizer = AutoTokenizer.from_pretrained("laskar-ks/alcyone-v0")

gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(gen("Once upon a time, there was a little",
          max_new_tokens=80,
          do_sample=True,
          temperature=0.8)[0]["generated_text"])

Roadmap

  • v0 — this checkpoint. Proof that the full pretraining loop works end-to-end.
  • v1 — same architecture, longer training (≥5,000 steps on the full TinyStories train split). Expected: noticeably more coherent stories.
  • v0-id — Bahasa Indonesia variant. Custom BPE tokenizer on an Indonesian corpus, same architecture.

About the name

Alcyone (η Tauri) is the brightest star in the Pleiades open star cluster. Part of a star-named model family alongside other projects (Parallax, Altair, Pleiades agents, etc.).

Author

Trained by Laskar as part of an AI engineering portfolio exploring agentic systems, multi-agent architectures, and foundational ML.

Downloads last month
34
Safetensors
Model size
4.22M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train laskar-ks/alcyone-v0