SoloLLM v3 123M Base

SoloLLM v3 123M Base is the smaller-than-GPT-2 ablation from the SoloLLM v3 project. It is a from-scratch GPT-style decoder-only base language model trained on one RTX 3090.

This is not the final best SoloLLM checkpoint. The final best model is bmax16634/sologpt-v3-150m-base. This 123M model is published because it is slightly smaller than GPT-2 small and documents the strict smaller-model test.

Bottom Line

The 123M model is slightly smaller than GPT-2 small and beats GPT-2 on most external checks, but it does not beat GPT-2 across every metric. It loses the project held-out perplexity comparison and some fixed-prompt generation diversity/repetition diagnostics.

Model Params Train tokens Held-out PPL WikiText-2 PPL LAMBADA PPL MC avg acc norm
GPT-2 small 124.44M public 25.32 45.32 40.62 41.05%
SoloLLM v3 123M 123.55M 9.80B 25.64 41.87 36.28 42.46%
SoloLLM v3 150M 151.87M 10.00B 24.90 41.18 35.35 42.71%

The honest read:

The 123M model is a strong smaller-than-GPT-2 ablation, but it does not prove that a smaller model beats GPT-2 small across the board.

Model Details

Item Value
Architecture Decoder-only GPT-style transformer
Parameters 123,551,232
Context length 1024
Tokenizer GPT-2 tokenizer
Embedding width 768
Layers 12
Attention heads 12
Positional method RoPE
Normalization RMSNorm
MLP SwiGLU
Weight tying Input/output embeddings tied
Training hardware Single RTX 3090
Training tokens 9,800,728,576

Training Data

The model was trained on the same curated 10B-token SoloLLM v3 dataset as the 150M final model:

Source Accepted tokens Share
FineWeb-Edu sample-10BT 4,000,001,532 40%
DCLM baseline 2,500,001,319 25%
FineWeb sample-10BT 1,499,997,774 15%
English Wikipedia 999,998,937 10%
OpenWebText 1,000,000,972 10%

Multiple-Choice Detail

Length-normalized accuracy:

Benchmark GPT-2 small SoloLLM v3 123M
HellaSwag 29.53% 29.85%
PIQA 63.60% 63.40%
ARC-Easy 40.35% 44.04%
ARC-Challenge 22.07% 24.08%
WinoGrande 49.72% 50.91%
Average 41.05% 42.46%

Files

File Purpose
model.safetensors Final model state dict
config.json Model/training config used to instantiate SoloGPT_v2
config_resolved.json Resolved run config from training
metrics_summary.json Training summary for the final checkpoint
model.py Minimal SoloGPT model implementation used by this checkpoint
configuration_sologpt.py Hugging Face AutoConfig remote-code wrapper
modeling_sologpt.py Hugging Face AutoModelForCausalLM remote-code wrapper
tokenizer.json GPT-2 tokenizer used for training and inference
tokenizer_config.json Tokenizer metadata with 1024-token context and EOS-as-pad
load_example.py Example loading and sampling script
docs/v3_final_gpt2_comparison.md Full final result writeup
docs/project_page.md Short portfolio-style project page

Usage

This repo supports Hugging Face AutoModelForCausalLM loading through custom remote code. Pass trust_remote_code=True when loading the model.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "bmax16634/sologpt-v3-123m-base"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True).to(device)
model.eval()

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=40,
        do_sample=True,
        temperature=0.8,
        top_k=40,
        use_cache=False,
        remove_invalid_values=True,
        renormalize_logits=True,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

For a runnable example, see load_example.py. For low-level state-dict loading, the raw PyTorch implementation is still included as model.py.

Intended Use

This model is intended for:

  • educational inspection of a smaller GPT-2-class base LM,
  • ablation comparison against sologpt-v3-150m-base,
  • text-completion experiments,
  • reproducing the SoloLLM v3 evaluation story.

It is not intended for production use, high-stakes decisions, factual QA, or chat/instruction-following use without additional tuning and safety evaluation.

Limitations

  • This is a small base model, not an assistant.
  • It can generate incorrect, biased, repetitive, or unsafe text.
  • It has no retrieval, tool use, or instruction tuning.
  • It does not beat GPT-2 small across every metric.
  • Training data came from broad public web/text sources and may contain undesirable content despite filtering.

Related Artifacts

License

The SoloLLM code and released weights are provided under the MIT License by the author. Training data sources retain their own licenses and terms.

Downloads last month
57
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train bmax16634/sologpt-v3-123m-base

Collection including bmax16634/sologpt-v3-123m-base

Evaluation results

  • Held-out perplexity on SoloLLM project held-out OpenWebText-style shards
    self-reported
    25.637
  • WikiText-2 perplexity on WikiText-2 test
    self-reported
    41.874
  • LAMBADA perplexity on LAMBADA
    self-reported
    36.278
  • LAMBADA last-word accuracy on LAMBADA
    self-reported
    0.328