Alcyone v0

A ~4M parameter GPT-2 style language model, pretrained from scratch on TinyStories as an educational exercise.

Named after the brightest star in the Pleiades cluster — part of a star-named model family.

Why this exists

Most "trained a model" portfolio entries are fine-tunes of large pretrained checkpoints. This is different: every weight in this model started as a random number. The goal was to walk through the full pretraining loop end-to-end — custom BPE tokenizer, randomly initialized GPT-2 architecture, causal LM objective, training loop, evaluation — on a free Google Colab T4 GPU.

The output isn't competitive with anything. The understanding is the deliverable.

Model details


Architecture	GPT-2 (decoder-only Transformer)
Parameters	~4.22 M
Layers	4
Embedding dim	256
Attention heads	4
Context length	128 tokens
Vocab size	4,000
Tokenizer	Byte-level BPE, trained from scratch on TinyStories
Initialization	Random (no pretrained weights)
Training objective	Causal language modeling (next-token prediction)

Training data

roneneldan/TinyStories — short children's stories generated by GPT-3.5/GPT-4, designed specifically so small language models can learn coherent English. A 5,000-story subset was used for this v0 run.

Training setup


Hardware	Google Colab Free Tier (NVIDIA Tesla T4, 15.6 GB VRAM)
Precision	fp16
Optimizer	AdamW (Hugging Face `Trainer` default)
Learning rate	3e-4, cosine schedule, 30 warmup steps
Weight decay	0.01
Batch size	32
Steps	300 (capped via `max_steps`)
Wall-clock time	~14 seconds
Final train loss	5.08 (from initial ~9, vocab=4000)

Intended use & limitations

This is a base language model, not an instruction-tuned chat model. Given a short English prompt, it will continue the text in TinyStories style (children's stories with characters like "Lily", "Tom", "Ben", simple plots).

It will NOT:

follow instructions
answer questions reliably
produce text outside the children's-story domain
maintain long-range coherence (context window is only 128 tokens)

This is v0 — explicitly an early, undercooked checkpoint. Output is occasionally repetitive and loses thread across sentences. That's expected at this scale.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model     = AutoModelForCausalLM.from_pretrained("laskar-ks/alcyone-v0")
tokenizer = AutoTokenizer.from_pretrained("laskar-ks/alcyone-v0")

gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(gen("Once upon a time, there was a little",
          max_new_tokens=80,
          do_sample=True,
          temperature=0.8)[0]["generated_text"])

Roadmap

v0 — this checkpoint. Proof that the full pretraining loop works end-to-end.
v1 — same architecture, longer training (≥5,000 steps on the full TinyStories train split). Expected: noticeably more coherent stories.
v0-id — Bahasa Indonesia variant. Custom BPE tokenizer on an Indonesian corpus, same architecture.

About the name

Alcyone (η Tauri) is the brightest star in the Pleiades open star cluster. Part of a star-named model family alongside other projects (Parallax, Altair, Pleiades agents, etc.).

Author

Trained by Laskar as part of an AI engineering portfolio exploring agentic systems, multi-agent architectures, and foundational ML.

Downloads last month: 34

Safetensors

Model size

4.22M params

Tensor type

F32

laskar-ks
/

alcyone-v0