Argonne 3.0-base

Argonne 3.0-base is a 2.88B-parameter decoder-only transformer language model from the Argonne 3.x family. It is a base (foundation) checkpoint trained from scratch on FineWeb-derived web text and is intended as a starting point for further continued pretraining, supervised fine-tuning, or preference optimization.

The architecture combines grouped-query attention with several stability-oriented additions (QK-norm, V-norm, sandwich norms, interleaved local/global attention, and a final logit softcap). Weights are stored in bf16 and split across 5 safetensor shards so the model can be loaded with transformers on commodity hardware.

Model architecture

Component Specification
Parameters 2,882,162,688 (~2.88B)
Layers 24 transformer blocks
Hidden size 3,072
Attention heads 12 query / 4 key-value (GQA)
Head dimension 256
Feed-forward SwiGLU MLP, 8,192 intermediate dim
Attention pattern Interleaved local/global causal attention
Local attention window 256 tokens (every other layer)
Normalization RMSNorm with QK / V / sandwich norms
Position encoding RoPE (ฮธ = 1,000,000)
Logit stabilization Final logit softcap = 15.0
Context length 1,024 tokens
Vocabulary size 151,669
Tied embeddings Yes (input โ†” output)

Training details

Item Value
Stages Two-stage causal language modeling (pretrain โ†’ continued pretrain)
Total optimizer steps 329,148
Tokens processed (cumulative) 76,050,702,336 (~76.05B)
Stage 1 tokens (pretrain) 20,839,021,454 (~20.84B, single epoch)
Stage 2 tokens (continued pretrain) 55,211,688,156 (~55.21B, single epoch)
Sequence length 1,024 tokens
Batch size per GPU 38
Gradient accumulation steps 2
Data-parallel world size 3 GPUs
Effective batch 233,472 tokens / step
Optimizer AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.95, weight decay 0.1)
Peak learning rate 3.0e-4
Min LR ratio 0.1
Schedule Warmup-Stable-Decay (WSD); 1,000 warmup steps, 0 cooldown (stable phase only)
Gradient clipping 1.0
Precision bf16 autocast (weights in fp32, optimizer states in fp32)
torch.compile Enabled (default mode)
Gradient checkpointing Enabled
Flash attention Enabled (kernels fall back gracefully if unavailable)
Final-slice average train loss 2.5168
Checkpoint dtype on Hub bfloat16
Weight format on Hub 5 sharded safetensors + index
Hardware 3ร— NVIDIA H200 GPUs (DDP)
Random seed 444

Stage 1 โ€” pretrain (pretrain.py)

  • Cold-started randomly initialized weights.
  • One full epoch over the FineWeb pretraining shard (20.84B tokens).
  • 1,000-step linear warmup followed by the WSD stable phase at LR 3.0e-4.

Stage 2 โ€” continued pretrain (continue_pretrain.py)

  • Resumed from the stage-1 checkpoint with a fresh optimizer / scheduler (data cursor reset to the new shard).
  • One full epoch over the FineWeb CC-MAIN-2025-21 shard (55.21B tokens).
  • Same hyperparameters as stage 1, no additional warmup.

Training data

Item Value
Pretrain corpus FineWeb (tokenized with the Qwen3 tokenizer); see HuggingFaceFW/fineweb
Continued-pretrain corpus FineWeb CC-MAIN-2025-21 dump (Qwen3 tokenizer); see HuggingFaceFW/fineweb
Tokenizer source Qwen/Qwen3-0.6B-Base (151,669-token vocab)

Tokenizer

This model reuses the Qwen3 tokenizer (vocabulary size 151,669) through the Qwen2Tokenizer compatibility class. The tokenizer files are bundled with the checkpoint so no extra download is required.

Source code

Built from the GitHub main branch: https://github.com/PursuitOfDataScience/ArgonneAI/tree/main

Key scripts used to produce this checkpoint:

  • model.py โ€” the ArgonneModel / ArgonneConfig architecture (bundled here as model.py)
  • pretrain.py โ€” stage 1 DDP pretraining loop
  • continue_pretrain.py โ€” stage 2 continued-pretraining loop

Training loss curve

The figure below tracks loss, perplexity, and learning rate against cumulative training tokens across both stages.

Training loss curve

The warmup-stable-decay schedule is visible in the LR panel: 1,000 linear warmup steps to 3.0e-4 followed by a flat stable phase (cooldown was set to 0 for this run).

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "PursuitOfDataScience/argonne-3.0-base"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
)

prompt = "Write a short paragraph about scientific computing at Argonne National Laboratory."
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

output_ids = model.generate(
    input_ids,
    max_length=input_ids.shape[1] + 128,
    temperature=0.8,
    top_p=0.95,
    top_k=50,
    do_sample=True,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Usage notes

  • Load with trust_remote_code=True so the custom ArgonneModel / ArgonneConfig classes (model.py) are registered.
  • The custom generate method on ArgonneModel uses max_length (total sequence length) rather than max_new_tokens; see the snippet above for the recommended pattern.
  • This is a base model: no instruction tuning, alignment, or safety filtering has been applied. Outputs can include factually incorrect, biased, or unsafe text.
  • Weights are published as 5 bf16 safetensor shards with a model.safetensors.index.json weight map for sharded loading.
  • The published context length is 1,024 tokens. RoPE uses ฮธ = 1,000,000 so the same checkpoint can be extended to longer contexts in follow-on stages.
  • Switch to greedy decoding (do_sample=False) if you want deterministic output.

Limitations

  • Trained on web data only; no instruction following, dialogue, or tool use.
  • 1,024-token context limits multi-document or long-form tasks without further long-context training.
  • Loss plateaued around โ‰ˆ2.5 (~12 PPL) on FineWeb โ€” typical for a 2.88B model trained on ~76B tokens, but well above frontier-scale models.

Citation

@misc{argonne30base,
  author = {PursuitOfDataScience},
  title = {Argonne 3.0-base},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/PursuitOfDataScience/argonne-3.0-base}
}
Downloads last month
29
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support