Instructions to use PursuitOfDataScience/argonne-3.0-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PursuitOfDataScience/argonne-3.0-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="PursuitOfDataScience/argonne-3.0-base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("PursuitOfDataScience/argonne-3.0-base", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use PursuitOfDataScience/argonne-3.0-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "PursuitOfDataScience/argonne-3.0-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PursuitOfDataScience/argonne-3.0-base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/PursuitOfDataScience/argonne-3.0-base

SGLang

How to use PursuitOfDataScience/argonne-3.0-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "PursuitOfDataScience/argonne-3.0-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PursuitOfDataScience/argonne-3.0-base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "PursuitOfDataScience/argonne-3.0-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PursuitOfDataScience/argonne-3.0-base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use PursuitOfDataScience/argonne-3.0-base with Docker Model Runner:
```
docker model run hf.co/PursuitOfDataScience/argonne-3.0-base
```

Argonne 3.0-base

Argonne 3.0-base is a 2.88B-parameter decoder-only transformer language model from the Argonne 3.x family. It is a base (foundation) checkpoint trained from scratch on FineWeb-derived web text and is intended as a starting point for further continued pretraining, supervised fine-tuning, or preference optimization.

The architecture combines grouped-query attention with several stability-oriented additions (QK-norm, V-norm, sandwich norms, interleaved local/global attention, and a final logit softcap). Weights are stored in bf16 and split across 5 safetensor shards so the model can be loaded with transformers on commodity hardware.

Model architecture

Component	Specification
Parameters	2,882,162,688 (~2.88B)
Layers	24 transformer blocks
Hidden size	3,072
Attention heads	12 query / 4 key-value (GQA)
Head dimension	256
Feed-forward	SwiGLU MLP, 8,192 intermediate dim
Attention pattern	Interleaved local/global causal attention
Local attention window	256 tokens (every other layer)
Normalization	RMSNorm with QK / V / sandwich norms
Position encoding	RoPE (θ = 1,000,000)
Logit stabilization	Final logit softcap = 15.0
Context length	1,024 tokens
Vocabulary size	151,669
Tied embeddings	Yes (input ↔ output)

Training details

Item	Value
Stages	Two-stage causal language modeling (pretrain → continued pretrain)
Total optimizer steps	329,148
Tokens processed (cumulative)	76,050,702,336 (~76.05B)
Stage 1 tokens (pretrain)	20,839,021,454 (~20.84B, single epoch)
Stage 2 tokens (continued pretrain)	55,211,688,156 (~55.21B, single epoch)
Sequence length	1,024 tokens
Batch size per GPU	38
Gradient accumulation steps	2
Data-parallel world size	3 GPUs
Effective batch	233,472 tokens / step
Optimizer	AdamW (β₁=0.9, β₂=0.95, weight decay 0.1)
Peak learning rate	3.0e-4
Min LR ratio	0.1
Schedule	Warmup-Stable-Decay (WSD); 1,000 warmup steps, 0 cooldown (stable phase only)
Gradient clipping	1.0
Precision	bf16 autocast (weights in fp32, optimizer states in fp32)
`torch.compile`	Enabled (default mode)
Gradient checkpointing	Enabled
Flash attention	Enabled (kernels fall back gracefully if unavailable)
Final-slice average train loss	2.5168
Checkpoint dtype on Hub	bfloat16
Weight format on Hub	5 sharded safetensors + index
Hardware	3× NVIDIA H200 GPUs (DDP)
Random seed	444

Stage 1 — pretrain (`pretrain.py`)

Cold-started randomly initialized weights.
One full epoch over the FineWeb pretraining shard (20.84B tokens).
1,000-step linear warmup followed by the WSD stable phase at LR 3.0e-4.

Stage 2 — continued pretrain (`continue_pretrain.py`)

Resumed from the stage-1 checkpoint with a fresh optimizer / scheduler (data cursor reset to the new shard).
One full epoch over the FineWeb CC-MAIN-2025-21 shard (55.21B tokens).
Same hyperparameters as stage 1, no additional warmup.

Training data

Item	Value
Pretrain corpus	FineWeb (tokenized with the Qwen3 tokenizer); see HuggingFaceFW/fineweb
Continued-pretrain corpus	FineWeb CC-MAIN-2025-21 dump (Qwen3 tokenizer); see HuggingFaceFW/fineweb
Tokenizer source	Qwen/Qwen3-0.6B-Base (151,669-token vocab)

Tokenizer

This model reuses the Qwen3 tokenizer (vocabulary size 151,669) through the Qwen2Tokenizer compatibility class. The tokenizer files are bundled with the checkpoint so no extra download is required.

Source code

Built from the GitHub main branch: https://github.com/PursuitOfDataScience/ArgonneAI/tree/main

Key scripts used to produce this checkpoint:

model.py — the ArgonneModel / ArgonneConfig architecture (bundled here as model.py)
pretrain.py — stage 1 DDP pretraining loop
continue_pretrain.py — stage 2 continued-pretraining loop

Training loss curve

The figure below tracks loss, perplexity, and learning rate against cumulative training tokens across both stages.

The warmup-stable-decay schedule is visible in the LR panel: 1,000 linear warmup steps to 3.0e-4 followed by a flat stable phase (cooldown was set to 0 for this run).

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "PursuitOfDataScience/argonne-3.0-base"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
)

prompt = "Write a short paragraph about scientific computing at Argonne National Laboratory."
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

output_ids = model.generate(
    input_ids,
    max_length=input_ids.shape[1] + 128,
    temperature=0.8,
    top_p=0.95,
    top_k=50,
    do_sample=True,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Usage notes

Load with trust_remote_code=True so the custom ArgonneModel / ArgonneConfig classes (model.py) are registered.
The custom generate method on ArgonneModel uses max_length (total sequence length) rather than max_new_tokens; see the snippet above for the recommended pattern.
This is a base model: no instruction tuning, alignment, or safety filtering has been applied. Outputs can include factually incorrect, biased, or unsafe text.
Weights are published as 5 bf16 safetensor shards with a model.safetensors.index.json weight map for sharded loading.
The published context length is 1,024 tokens. RoPE uses θ = 1,000,000 so the same checkpoint can be extended to longer contexts in follow-on stages.
Switch to greedy decoding (do_sample=False) if you want deterministic output.

Limitations

Trained on web data only; no instruction following, dialogue, or tool use.
1,024-token context limits multi-document or long-form tasks without further long-context training.
Loss plateaued around ≈2.5 (~12 PPL) on FineWeb — typical for a 2.88B model trained on ~76B tokens, but well above frontier-scale models.

Citation

@misc{argonne30base,
  author = {PursuitOfDataScience},
  title = {Argonne 3.0-base},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/PursuitOfDataScience/argonne-3.0-base}
}

Downloads last month: 29

Safetensors

Model size

3B params

Tensor type

BF16