Instructions to use jbomdev/AlterEgo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jbomdev/AlterEgo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="jbomdev/AlterEgo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("jbomdev/AlterEgo")
model = AutoModelForMultimodalLM.from_pretrained("jbomdev/AlterEgo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use jbomdev/AlterEgo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jbomdev/AlterEgo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jbomdev/AlterEgo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jbomdev/AlterEgo

SGLang

How to use jbomdev/AlterEgo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "jbomdev/AlterEgo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jbomdev/AlterEgo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "jbomdev/AlterEgo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jbomdev/AlterEgo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use jbomdev/AlterEgo with Docker Model Runner:
```
docker model run hf.co/jbomdev/AlterEgo
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

🧠 AlterEgo-373M

A 373-million-parameter language model designed, trained, and served entirely from scratch.

Introduction

AlterEgo is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat.

It is the model at the heart of LLME, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside llama.cpp GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around.

This repository contains the model. The training and architecture code lives in the AlterEgo repo; the serving platform lives in the LLME repo.

Two formats are published. This repo is the Hugging Face LlamaForCausalLM conversion, for drop-in use with transformers, vLLM, and GGUF tooling. The original checkpoint - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately as jbomdev/alterego_raw. This version is a numerically-lossless conversion of it (verified: max logit difference ~1e-6).

What it is and isn't. AlterEgo is a research / learning artifact - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is not a production assistant and won't compete with billion-parameter models. See Limitations.

Architecture

A modern Llama-style decoder (and, thanks to that, it loads as a standard LlamaForCausalLM).

Component	Choice
Type	Decoder-only transformer (autoregressive)
Parameters	~373M (input/output embeddings tied)
Layers	24
Model dimension	1024
Attention	Grouped-Query Attention - 16 query heads / 4 KV heads (head dim 64)
Positional encoding	Rotary embeddings (RoPE), θ = 10,000
Normalization	RMSNorm (pre-norm)
Feed-forward	SwiGLU, hidden dim 2816
Context length	2048
Vocabulary	100,352
Tokenizer	`cl100k_base` (tiktoken) extended with ChatML special tokens

Training

AlterEgo was trained in two stages on a single NVIDIA RTX 4090.

Stage 1 - Pretraining

Pre-trained on FineWeb-Edu (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl.

The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence.

Stage 2 - Supervised fine-tuning

Instruction-tuned on UltraChat-200K (HuggingFaceH4), formatted as multi-turn ChatML.

Hyperparameters

	Pretraining	SFT
Dataset	FineWeb-Edu	UltraChat-200K
Tokens / steps	~10B / 19,073	~64M / 244
Global batch	524,288 tokens (micro 2 × 2048 × 128 grad-accum)	same scheme
Optimizer	AdamW (β = 0.9, 0.95; ε = 1e-8; fused)	same
Weight decay	0.1 (decoupled; excluded from norms/biases)	same
LR schedule	linear warmup (1,900 steps) → cosine decay	cosine
Peak / min LR	3e-4 → 3e-5	low (fine-tune range)
Grad clipping	global-norm 1.0	1.0
Precision	bfloat16 autocast	bfloat16
Throughput / wall-clock	~32k tok/s · ~86 GPU-h (3.6 days)	~39k tok/s · ~28 min
Other	`torch.compile`, gradient checkpointing, FlashAttention (SDPA)	same
Final loss (train / val)	2.94 / 2.89	1.83 / 1.81

Evaluation

Benchmarked with EleutherAI's lm-evaluation-harness (0-shot).

Benchmark	Metric	AlterEgo-373M	Random
lambada_openai	acc	31.6%	~0%
hellaswag	acc_norm	38.0%	25%
arc_easy	acc_norm	52.7%	25%
arc_challenge	acc_norm	27.3%	25%
piqa	acc_norm	65.7%	50%
winogrande	acc	51.3%	50%
openbookqa	acc_norm	32.2%	25%
sciq	acc_norm	72.2%	25%
boolq	acc	61.8%	50%

For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo")
model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content":
     "You are Alter Ego, a small AI built from scratch. You're casual and direct. "
     "You're not great with facts, math, or current events - when you don't know "
     "something, just say so. You're better at chatting than at answering questions."},
    {"role": "user", "content": "Tell me something interesting about the ocean."},
]
ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

out = model.generate(
    ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.1,
)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Recommended generation settings

These are the defaults AlterEgo was tuned and served with in LLME:

Parameter	Value
`temperature`	0.7
`top_k`	50
`top_p`	1.0
`repetition_penalty`	1.1
`max_new_tokens`	200

Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on <|im_end|> or <|endoftext|>.

Chat format

AlterEgo uses ChatML:

<|im_start|>system
{system prompt}<|im_end|>
<|im_start|>user
{message}<|im_end|>
<|im_start|>assistant

Run it locally (GGUF)

Feel free to use my pre-made GGUF's and quants by visiting The GGUF's and quants page. Or running the model with ollama.

Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself:

python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16

Limitations

AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one:

Capability - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions."
Language - English only.
Safety - it is not safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards.
Bias - it inherits biases from FineWeb-Edu (web text) and UltraChat.

License

Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K.

Citation

@misc{alterego2026,
  title  = {AlterEgo: A 373M language model trained from scratch},
  author = {J-bom},
  year   = {2026},
  url    = {https://github.com/J-bom/AlterEgo}
}

Credits - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author.

Downloads last month: 10

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for jbomdev/AlterEgo

Quantizations

1 model

Datasets used to train jbomdev/AlterEgo

Evaluation results

acc on lambada_openai
self-reported

0.316
acc_norm on hellaswag
self-reported

0.380
acc_norm on arc_easy
self-reported

0.527
acc_norm on arc_challenge
self-reported

0.273
acc_norm on piqa
self-reported

0.657
acc on winogrande
self-reported

0.513
acc_norm on openbookqa
self-reported

0.322
acc_norm on sciq
self-reported

0.722