🧠 AlterEgo-373M

A 373-million-parameter language model designed, trained, and served entirely from scratch.

Code Platform Architecture Params support


Introduction

AlterEgo is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat.

It is the model at the heart of LLME, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside llama.cpp GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around.

This repository contains the model. The training and architecture code lives in the AlterEgo repo; the serving platform lives in the LLME repo.

Two formats are published. This repo is the Hugging Face LlamaForCausalLM conversion, for drop-in use with transformers, vLLM, and GGUF tooling. The original checkpoint - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately as jbomdev/alterego_raw. This version is a numerically-lossless conversion of it (verified: max logit difference ~1e-6).

What it is and isn't. AlterEgo is a research / learning artifact - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is not a production assistant and won't compete with billion-parameter models. See Limitations.

Architecture

A modern Llama-style decoder (and, thanks to that, it loads as a standard LlamaForCausalLM).

Component Choice
Type Decoder-only transformer (autoregressive)
Parameters ~373M (input/output embeddings tied)
Layers 24
Model dimension 1024
Attention Grouped-Query Attention - 16 query heads / 4 KV heads (head dim 64)
Positional encoding Rotary embeddings (RoPE), θ = 10,000
Normalization RMSNorm (pre-norm)
Feed-forward SwiGLU, hidden dim 2816
Context length 2048
Vocabulary 100,352
Tokenizer cl100k_base (tiktoken) extended with ChatML special tokens

Training

AlterEgo was trained in two stages on a single NVIDIA RTX 4090.

Stage 1 - Pretraining

Pre-trained on FineWeb-Edu (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl.

Pretraining loss

Training dynamics

The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence.

Stage 2 - Supervised fine-tuning

Instruction-tuned on UltraChat-200K (HuggingFaceH4), formatted as multi-turn ChatML.

SFT loss

Hyperparameters

Pretraining SFT
Dataset FineWeb-Edu UltraChat-200K
Tokens / steps ~10B / 19,073 ~64M / 244
Global batch 524,288 tokens (micro 2 × 2048 × 128 grad-accum) same scheme
Optimizer AdamW (β = 0.9, 0.95; ε = 1e-8; fused) same
Weight decay 0.1 (decoupled; excluded from norms/biases) same
LR schedule linear warmup (1,900 steps) → cosine decay cosine
Peak / min LR 3e-4 → 3e-5 low (fine-tune range)
Grad clipping global-norm 1.0 1.0
Precision bfloat16 autocast bfloat16
Throughput / wall-clock ~32k tok/s · ~86 GPU-h (3.6 days) ~39k tok/s · ~28 min
Other torch.compile, gradient checkpointing, FlashAttention (SDPA) same
Final loss (train / val) 2.94 / 2.89 1.83 / 1.81

Evaluation

Benchmarked with EleutherAI's lm-evaluation-harness (0-shot).

Benchmark Metric AlterEgo-373M Random
lambada_openai acc 31.6% ~0%
hellaswag acc_norm 38.0% 25%
arc_easy acc_norm 52.7% 25%
arc_challenge acc_norm 27.3% 25%
piqa acc_norm 65.7% 50%
winogrande acc 51.3% 50%
openbookqa acc_norm 32.2% 25%
sciq acc_norm 72.2% 25%
boolq acc 61.8% 50%

For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo")
model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content":
     "You are Alter Ego, a small AI built from scratch. You're casual and direct. "
     "You're not great with facts, math, or current events - when you don't know "
     "something, just say so. You're better at chatting than at answering questions."},
    {"role": "user", "content": "Tell me something interesting about the ocean."},
]
ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

out = model.generate(
    ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.1,
)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Recommended generation settings

These are the defaults AlterEgo was tuned and served with in LLME:

Parameter Value
temperature 0.7
top_k 50
top_p 1.0
repetition_penalty 1.1
max_new_tokens 200

Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on <|im_end|> or <|endoftext|>.

Chat format

AlterEgo uses ChatML:

<|im_start|>system
{system prompt}<|im_end|>
<|im_start|>user
{message}<|im_end|>
<|im_start|>assistant

Run it locally (GGUF)

Feel free to use my pre-made GGUF's and quants by visiting The GGUF's and quants page. Or running the model with ollama.

Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself:

python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16

Limitations

AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one:

  • Capability - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions."
  • Language - English only.
  • Safety - it is not safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards.
  • Bias - it inherits biases from FineWeb-Edu (web text) and UltraChat.

License

Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K.

Citation

@misc{alterego2026,
  title  = {AlterEgo: A 373M language model trained from scratch},
  author = {J-bom},
  year   = {2026},
  url    = {https://github.com/J-bom/AlterEgo}
}

Credits - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author.

Downloads last month
10
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jbomdev/AlterEgo

Quantizations
1 model

Datasets used to train jbomdev/AlterEgo

Evaluation results