Instructions to use PursuitOfDataScience/argonne-3.0-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PursuitOfDataScience/argonne-3.0-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PursuitOfDataScience/argonne-3.0-base") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("PursuitOfDataScience/argonne-3.0-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use PursuitOfDataScience/argonne-3.0-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PursuitOfDataScience/argonne-3.0-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PursuitOfDataScience/argonne-3.0-base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PursuitOfDataScience/argonne-3.0-base
- SGLang
How to use PursuitOfDataScience/argonne-3.0-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PursuitOfDataScience/argonne-3.0-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PursuitOfDataScience/argonne-3.0-base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PursuitOfDataScience/argonne-3.0-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PursuitOfDataScience/argonne-3.0-base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use PursuitOfDataScience/argonne-3.0-base with Docker Model Runner:
docker model run hf.co/PursuitOfDataScience/argonne-3.0-base
Argonne 3.0-base
Argonne 3.0-base is a 2.88B-parameter decoder-only transformer language model from the Argonne 3.x family. It is a base (foundation) checkpoint trained from scratch on FineWeb-derived web text and is intended as a starting point for further continued pretraining, supervised fine-tuning, or preference optimization.
The architecture combines grouped-query attention with several stability-oriented additions (QK-norm, V-norm, sandwich norms, interleaved local/global attention, and a final logit softcap). Weights are stored in bf16 and split across 5 safetensor shards so the model can be loaded with transformers on commodity hardware.
Model architecture
| Component | Specification |
|---|---|
| Parameters | 2,882,162,688 (~2.88B) |
| Layers | 24 transformer blocks |
| Hidden size | 3,072 |
| Attention heads | 12 query / 4 key-value (GQA) |
| Head dimension | 256 |
| Feed-forward | SwiGLU MLP, 8,192 intermediate dim |
| Attention pattern | Interleaved local/global causal attention |
| Local attention window | 256 tokens (every other layer) |
| Normalization | RMSNorm with QK / V / sandwich norms |
| Position encoding | RoPE (ฮธ = 1,000,000) |
| Logit stabilization | Final logit softcap = 15.0 |
| Context length | 1,024 tokens |
| Vocabulary size | 151,669 |
| Tied embeddings | Yes (input โ output) |
Training details
| Item | Value |
|---|---|
| Stages | Two-stage causal language modeling (pretrain โ continued pretrain) |
| Total optimizer steps | 329,148 |
| Tokens processed (cumulative) | 76,050,702,336 (~76.05B) |
| Stage 1 tokens (pretrain) | 20,839,021,454 (~20.84B, single epoch) |
| Stage 2 tokens (continued pretrain) | 55,211,688,156 (~55.21B, single epoch) |
| Sequence length | 1,024 tokens |
| Batch size per GPU | 38 |
| Gradient accumulation steps | 2 |
| Data-parallel world size | 3 GPUs |
| Effective batch | 233,472 tokens / step |
| Optimizer | AdamW (ฮฒโ=0.9, ฮฒโ=0.95, weight decay 0.1) |
| Peak learning rate | 3.0e-4 |
| Min LR ratio | 0.1 |
| Schedule | Warmup-Stable-Decay (WSD); 1,000 warmup steps, 0 cooldown (stable phase only) |
| Gradient clipping | 1.0 |
| Precision | bf16 autocast (weights in fp32, optimizer states in fp32) |
torch.compile |
Enabled (default mode) |
| Gradient checkpointing | Enabled |
| Flash attention | Enabled (kernels fall back gracefully if unavailable) |
| Final-slice average train loss | 2.5168 |
| Checkpoint dtype on Hub | bfloat16 |
| Weight format on Hub | 5 sharded safetensors + index |
| Hardware | 3ร NVIDIA H200 GPUs (DDP) |
| Random seed | 444 |
Stage 1 โ pretrain (pretrain.py)
- Cold-started randomly initialized weights.
- One full epoch over the FineWeb pretraining shard (20.84B tokens).
- 1,000-step linear warmup followed by the WSD stable phase at LR 3.0e-4.
Stage 2 โ continued pretrain (continue_pretrain.py)
- Resumed from the stage-1 checkpoint with a fresh optimizer / scheduler (data cursor reset to the new shard).
- One full epoch over the FineWeb CC-MAIN-2025-21 shard (55.21B tokens).
- Same hyperparameters as stage 1, no additional warmup.
Training data
| Item | Value |
|---|---|
| Pretrain corpus | FineWeb (tokenized with the Qwen3 tokenizer); see HuggingFaceFW/fineweb |
| Continued-pretrain corpus | FineWeb CC-MAIN-2025-21 dump (Qwen3 tokenizer); see HuggingFaceFW/fineweb |
| Tokenizer source | Qwen/Qwen3-0.6B-Base (151,669-token vocab) |
Tokenizer
This model reuses the Qwen3 tokenizer (vocabulary size 151,669) through the Qwen2Tokenizer compatibility class. The tokenizer files are bundled with the checkpoint so no extra download is required.
Source code
Built from the GitHub main branch: https://github.com/PursuitOfDataScience/ArgonneAI/tree/main
Key scripts used to produce this checkpoint:
model.pyโ theArgonneModel/ArgonneConfigarchitecture (bundled here asmodel.py)pretrain.pyโ stage 1 DDP pretraining loopcontinue_pretrain.pyโ stage 2 continued-pretraining loop
Training loss curve
The figure below tracks loss, perplexity, and learning rate against cumulative training tokens across both stages.
The warmup-stable-decay schedule is visible in the LR panel: 1,000 linear warmup steps to 3.0e-4 followed by a flat stable phase (cooldown was set to 0 for this run).
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "PursuitOfDataScience/argonne-3.0-base"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
dtype=torch.bfloat16,
)
prompt = "Write a short paragraph about scientific computing at Argonne National Laboratory."
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
output_ids = model.generate(
input_ids,
max_length=input_ids.shape[1] + 128,
temperature=0.8,
top_p=0.95,
top_k=50,
do_sample=True,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Usage notes
- Load with
trust_remote_code=Trueso the customArgonneModel/ArgonneConfigclasses (model.py) are registered. - The custom
generatemethod onArgonneModelusesmax_length(total sequence length) rather thanmax_new_tokens; see the snippet above for the recommended pattern. - This is a base model: no instruction tuning, alignment, or safety filtering has been applied. Outputs can include factually incorrect, biased, or unsafe text.
- Weights are published as 5 bf16 safetensor shards with a
model.safetensors.index.jsonweight map for sharded loading. - The published context length is 1,024 tokens. RoPE uses ฮธ = 1,000,000 so the same checkpoint can be extended to longer contexts in follow-on stages.
- Switch to greedy decoding (
do_sample=False) if you want deterministic output.
Limitations
- Trained on web data only; no instruction following, dialogue, or tool use.
- 1,024-token context limits multi-document or long-form tasks without further long-context training.
- Loss plateaued around โ2.5 (~12 PPL) on FineWeb โ typical for a 2.88B model trained on ~76B tokens, but well above frontier-scale models.
Citation
@misc{argonne30base,
author = {PursuitOfDataScience},
title = {Argonne 3.0-base},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/PursuitOfDataScience/argonne-3.0-base}
}
- Downloads last month
- 29
