Instructions to use r0b0tlab/Agents-A1-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use r0b0tlab/Agents-A1-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="r0b0tlab/Agents-A1-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("r0b0tlab/Agents-A1-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("r0b0tlab/Agents-A1-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use r0b0tlab/Agents-A1-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "r0b0tlab/Agents-A1-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/Agents-A1-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/r0b0tlab/Agents-A1-NVFP4

SGLang

How to use r0b0tlab/Agents-A1-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "r0b0tlab/Agents-A1-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/Agents-A1-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "r0b0tlab/Agents-A1-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/Agents-A1-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use r0b0tlab/Agents-A1-NVFP4 with Docker Model Runner:
```
docker model run hf.co/r0b0tlab/Agents-A1-NVFP4
```

Agents-A1 NVFP4

This repository contains an NVIDIA ModelOpt NVFP4 quantization of InternScience/Agents-A1, a 35B Qwen3.5 MoE agentic model.

Credits and Attribution

This NVFP4 checkpoint is derived from InternScience/Agents-A1.

Base model: InternScience, for the Agents-A1 model, training recipe, technical report, and original BF16 Hugging Face release.
Quantization tooling: NVIDIA, for NVIDIA TensorRT Model Optimizer / NVIDIA ModelOpt, used to produce the NVFP4 ModelOpt checkpoint.
Model architecture and runtime ecosystem: Hugging Face Transformers, Safetensors, Accelerate, and the Hugging Face Hub.
Calibration data: CNN/DailyMail via Hugging Face Datasets, used for text-path post-training calibration.
Inference ecosystem: vLLM/SGLang compatibility is inherited from the Qwen3.5 MoE / ModelOpt NVFP4 ecosystem, subject to runtime support and validation.

Quantization Summary

Field	Value
Base model	`InternScience/Agents-A1`
Quantization tool	NVIDIA ModelOpt `0.44.0`
Quantization format	NVFP4 / ModelOpt FP4
ModelOpt config	`mtq.NVFP4_MLP_ONLY_CFG`
Calibration data	`abisee/cnn_dailymail`, text-only calibration
Calibration sequence length	1024
Architecture	`Qwen3_5MoeForConditionalGeneration`
License	Apache-2.0, following the base model

Quantization Policy

Agents-A1 is a hybrid Qwen3.5 MoE model with 30 linear_attention layers, 10 full-attention layers, 256 experts per layer, and a vision tower. This checkpoint uses an MLP/MoE-only NVFP4 policy for the first release.

The following module families were explicitly excluded from NVFP4 quantization and preserved in BF16:

[
  "*visual*",
  "*vision*",
  "*patch_embed*",
  "*pos_embed*",
  "*merger*",
  "*linear_attn*",
  "*linear_attention*",
  "*self_attn*",
  "*attn*",
  "*embed_tokens*",
  "*lm_head*",
  "*mtp*"
]

Rationale:

The MoE/MLP expert layers are the largest parameter family and are the correct target for NVFP4 compression.
The GDN/linear_attn path is not standard dense transformer attention and is excluded for compatibility.
Vision modules are preserved to avoid multimodal degradation from text-only calibration.
Embeddings, lm_head, and MTP-sensitive modules are preserved in BF16.

Files

hf_quant_config.json — ModelOpt quantization metadata used by compatible inference engines.
modelopt_exclusions.json — exact exclusion list used during quantization.
config.json, tokenizer, and processor files are copied from the base model and patched only as required for export consistency.

Validation Status

This release is a quantized checkpoint, not a new fine-tune. It does not claim quality improvement over BF16.

Runtime smoke testing on NVIDIA GB10 / SM121 completed with the companion container recipe at r0b0tlab/agents-a1-nvfp4-sm121-vllm.

Validated evidence includes:

Container audit on NVIDIA GB10 with CUDA capability [12, 1].
vLLM extension imports: vllm._C, vllm._C_stable_libtorch, vllm._moe_C.
Native FP4 support checks: cutlass_scaled_mm_supports_fp4(121) and (120) return true.
Runtime log selection of FlashInferCutlassNvFp4LinearKernel and FLASHINFER_CUTLASS for NVFP4/MoE.
OpenAI-compatible /v1/models and /v1/chat/completions probes against the running container.
Lightweight live-container benchmark evidence in the companion repo: GSM8K 50-question lm-eval run at 98.00% exact match, direct HumanEval 50-question run at 48/50 (96.00%), c1/c2/c4/c8 concurrency sweep with 100% request success, and GPU telemetry including power draw.

Benchmark snapshot

The benchmark run agents-a1-nvfp4-gsm8k50-humaneval50-20260701T194211Z used the live OpenAI-compatible endpoint at http://127.0.0.1:18080/v1 with chat_template_kwargs.enable_thinking=false for scored requests.

Suite	Harness	Samples	Result	Notes
GSM8K	lm-eval `gsm8k`	50	strict 98.00%, flexible 98.00%	`num_concurrent=2`
HumanEval	direct OpenAI-compatible evaluator	50	48/50 (96.00%)	code extracted/evaluated locally
HumanEval	stock lm-eval `humaneval`	50	0.00%	preserved as harness-interference evidence; stock stop rules truncate chat-model output

Combined telemetry across GSM8K, HumanEval, and direct HumanEval averaged 27.88 W GPU power draw, 70.75% GPU utilization, and 58.83°C, with maxima of 36.00 W, 96.00%, and 65.00°C over 166 telemetry samples. The c8 concurrency sweep completed 24/24 requests successfully. See the companion repo's benchmarks/agents-a1-nvfp4-gsm8k50-humaneval50-20260701T194211Z/ directory for raw logs, samples, summaries, telemetry CSVs, and MANIFEST.sha256.

SM121 container quick start

docker run --rm --gpus all --ipc=host \
  --name agents-a1-nvfp4-vllm \
  -p 18080:8000 \
  -e MODEL_ID=r0b0tlab/Agents-A1-NVFP4 \
  ghcr.io/r0b0tlab/agents-a1-nvfp4-sm121-vllm:latest

For fully pinned local reproduction, clone/download this model and mount it read-only:

docker run --rm --gpus all --ipc=host \
  --name agents-a1-nvfp4-vllm \
  -p 18080:8000 \
  -v /path/to/Agents-A1-NVFP4:/models/Agents-A1-NVFP4:ro \
  ghcr.io/r0b0tlab/agents-a1-nvfp4-sm121-vllm:latest

Limitations

Calibration is text-only; vision components are preserved in BF16 rather than calibrated.
This card does not claim benchmark parity until BF16-vs-NVFP4 evaluations are published.
Runtime support depends on the inference engine's ModelOpt/NVFP4 implementation.

Citation

@misc{internscience_agents_a1_2026,
  title = {Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
  author = {InternScience},
  year = {2026},
  url = {https://huggingface.co/InternScience/Agents-A1}
}

License

This quantized checkpoint follows the base model license, Apache-2.0. Users must also comply with the licenses and terms for the base model, calibration data, NVIDIA ModelOpt, Hugging Face libraries, and any inference runtime used.

Downloads last month: -

Safetensors

Model size

19B params

Tensor type

BF16

F8_E4M3

Model tree for r0b0tlab/Agents-A1-NVFP4

Base model

InternScience/Agents-A1

Quantized

(41)

this model