Instructions to use jbomdev/AlterEgo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jbomdev/AlterEgo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="jbomdev/AlterEgo") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("jbomdev/AlterEgo") model = AutoModelForMultimodalLM.from_pretrained("jbomdev/AlterEgo") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use jbomdev/AlterEgo with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jbomdev/AlterEgo" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jbomdev/AlterEgo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jbomdev/AlterEgo
- SGLang
How to use jbomdev/AlterEgo with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jbomdev/AlterEgo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jbomdev/AlterEgo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jbomdev/AlterEgo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jbomdev/AlterEgo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use jbomdev/AlterEgo with Docker Model Runner:
docker model run hf.co/jbomdev/AlterEgo
🧠 AlterEgo-373M
A 373-million-parameter language model designed, trained, and served entirely from scratch.
Introduction
AlterEgo is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat.
It is the model at the heart of LLME, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside llama.cpp GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around.
This repository contains the model. The training and architecture code lives in the AlterEgo repo; the serving platform lives in the LLME repo.
Two formats are published. This repo is the Hugging Face
LlamaForCausalLMconversion, for drop-in use withtransformers, vLLM, and GGUF tooling. The original checkpoint - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately asjbomdev/alterego_raw. This version is a numerically-lossless conversion of it (verified: max logit difference ~1e-6).
What it is and isn't. AlterEgo is a research / learning artifact - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is not a production assistant and won't compete with billion-parameter models. See Limitations.
Architecture
A modern Llama-style decoder (and, thanks to that, it loads as a standard LlamaForCausalLM).
| Component | Choice |
|---|---|
| Type | Decoder-only transformer (autoregressive) |
| Parameters | ~373M (input/output embeddings tied) |
| Layers | 24 |
| Model dimension | 1024 |
| Attention | Grouped-Query Attention - 16 query heads / 4 KV heads (head dim 64) |
| Positional encoding | Rotary embeddings (RoPE), θ = 10,000 |
| Normalization | RMSNorm (pre-norm) |
| Feed-forward | SwiGLU, hidden dim 2816 |
| Context length | 2048 |
| Vocabulary | 100,352 |
| Tokenizer | cl100k_base (tiktoken) extended with ChatML special tokens |
Training
AlterEgo was trained in two stages on a single NVIDIA RTX 4090.
Stage 1 - Pretraining
Pre-trained on FineWeb-Edu (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl.
The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence.
Stage 2 - Supervised fine-tuning
Instruction-tuned on UltraChat-200K (HuggingFaceH4), formatted as multi-turn ChatML.
Hyperparameters
| Pretraining | SFT | |
|---|---|---|
| Dataset | FineWeb-Edu | UltraChat-200K |
| Tokens / steps | ~10B / 19,073 | ~64M / 244 |
| Global batch | 524,288 tokens (micro 2 × 2048 × 128 grad-accum) | same scheme |
| Optimizer | AdamW (β = 0.9, 0.95; ε = 1e-8; fused) | same |
| Weight decay | 0.1 (decoupled; excluded from norms/biases) | same |
| LR schedule | linear warmup (1,900 steps) → cosine decay | cosine |
| Peak / min LR | 3e-4 → 3e-5 | low (fine-tune range) |
| Grad clipping | global-norm 1.0 | 1.0 |
| Precision | bfloat16 autocast | bfloat16 |
| Throughput / wall-clock | ~32k tok/s · ~86 GPU-h (3.6 days) | ~39k tok/s · ~28 min |
| Other | torch.compile, gradient checkpointing, FlashAttention (SDPA) |
same |
| Final loss (train / val) | 2.94 / 2.89 | 1.83 / 1.81 |
Evaluation
Benchmarked with EleutherAI's lm-evaluation-harness (0-shot).
| Benchmark | Metric | AlterEgo-373M | Random |
|---|---|---|---|
| lambada_openai | acc | 31.6% | ~0% |
| hellaswag | acc_norm | 38.0% | 25% |
| arc_easy | acc_norm | 52.7% | 25% |
| arc_challenge | acc_norm | 27.3% | 25% |
| piqa | acc_norm | 65.7% | 50% |
| winogrande | acc | 51.3% | 50% |
| openbookqa | acc_norm | 32.2% | 25% |
| sciq | acc_norm | 72.2% | 25% |
| boolq | acc | 61.8% | 50% |
For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande).
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo")
model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16)
messages = [
{"role": "system", "content":
"You are Alter Ego, a small AI built from scratch. You're casual and direct. "
"You're not great with facts, math, or current events - when you don't know "
"something, just say so. You're better at chatting than at answering questions."},
{"role": "user", "content": "Tell me something interesting about the ocean."},
]
ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
out = model.generate(
ids,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=1.0,
repetition_penalty=1.1,
)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Recommended generation settings
These are the defaults AlterEgo was tuned and served with in LLME:
| Parameter | Value |
|---|---|
temperature |
0.7 |
top_k |
50 |
top_p |
1.0 |
repetition_penalty |
1.1 |
max_new_tokens |
200 |
Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on <|im_end|> or <|endoftext|>.
Chat format
AlterEgo uses ChatML:
<|im_start|>system
{system prompt}<|im_end|>
<|im_start|>user
{message}<|im_end|>
<|im_start|>assistant
Run it locally (GGUF)
Feel free to use my pre-made GGUF's and quants by visiting The GGUF's and quants page. Or running the model with ollama.
Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself:
python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16
Limitations
AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one:
- Capability - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions."
- Language - English only.
- Safety - it is not safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards.
- Bias - it inherits biases from FineWeb-Edu (web text) and UltraChat.
License
Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K.
Citation
@misc{alterego2026,
title = {AlterEgo: A 373M language model trained from scratch},
author = {J-bom},
year = {2026},
url = {https://github.com/J-bom/AlterEgo}
}
Credits - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author.
- Downloads last month
- 10
Model tree for jbomdev/AlterEgo
Datasets used to train jbomdev/AlterEgo
HuggingFaceH4/ultrachat_200k
Evaluation results
- acc on lambada_openaiself-reported0.316
- acc_norm on hellaswagself-reported0.380
- acc_norm on arc_easyself-reported0.527
- acc_norm on arc_challengeself-reported0.273
- acc_norm on piqaself-reported0.657
- acc on winograndeself-reported0.513
- acc_norm on openbookqaself-reported0.322
- acc_norm on sciqself-reported0.722


