Instructions to use r0b0tlab/Agents-A1-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use r0b0tlab/Agents-A1-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="r0b0tlab/Agents-A1-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("r0b0tlab/Agents-A1-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("r0b0tlab/Agents-A1-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use r0b0tlab/Agents-A1-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "r0b0tlab/Agents-A1-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r0b0tlab/Agents-A1-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/r0b0tlab/Agents-A1-NVFP4
- SGLang
How to use r0b0tlab/Agents-A1-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "r0b0tlab/Agents-A1-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r0b0tlab/Agents-A1-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "r0b0tlab/Agents-A1-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r0b0tlab/Agents-A1-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use r0b0tlab/Agents-A1-NVFP4 with Docker Model Runner:
docker model run hf.co/r0b0tlab/Agents-A1-NVFP4
Agents-A1 NVFP4
This repository contains an NVIDIA ModelOpt NVFP4 quantization of InternScience/Agents-A1, a 35B Qwen3.5 MoE agentic model.
Credits and Attribution
This NVFP4 checkpoint is derived from InternScience/Agents-A1.
- Base model: InternScience, for the Agents-A1 model, training recipe, technical report, and original BF16 Hugging Face release.
- Quantization tooling: NVIDIA, for NVIDIA TensorRT Model Optimizer / NVIDIA ModelOpt, used to produce the NVFP4 ModelOpt checkpoint.
- Model architecture and runtime ecosystem: Hugging Face Transformers, Safetensors, Accelerate, and the Hugging Face Hub.
- Calibration data: CNN/DailyMail via Hugging Face Datasets, used for text-path post-training calibration.
- Inference ecosystem: vLLM/SGLang compatibility is inherited from the Qwen3.5 MoE / ModelOpt NVFP4 ecosystem, subject to runtime support and validation.
Quantization Summary
| Field | Value |
|---|---|
| Base model | InternScience/Agents-A1 |
| Quantization tool | NVIDIA ModelOpt 0.44.0 |
| Quantization format | NVFP4 / ModelOpt FP4 |
| ModelOpt config | mtq.NVFP4_MLP_ONLY_CFG |
| Calibration data | abisee/cnn_dailymail, text-only calibration |
| Calibration sequence length | 1024 |
| Architecture | Qwen3_5MoeForConditionalGeneration |
| License | Apache-2.0, following the base model |
Quantization Policy
Agents-A1 is a hybrid Qwen3.5 MoE model with 30 linear_attention layers, 10 full-attention layers, 256 experts per layer, and a vision tower. This checkpoint uses an MLP/MoE-only NVFP4 policy for the first release.
The following module families were explicitly excluded from NVFP4 quantization and preserved in BF16:
[
"*visual*",
"*vision*",
"*patch_embed*",
"*pos_embed*",
"*merger*",
"*linear_attn*",
"*linear_attention*",
"*self_attn*",
"*attn*",
"*embed_tokens*",
"*lm_head*",
"*mtp*"
]
Rationale:
- The MoE/MLP expert layers are the largest parameter family and are the correct target for NVFP4 compression.
- The GDN/
linear_attnpath is not standard dense transformer attention and is excluded for compatibility. - Vision modules are preserved to avoid multimodal degradation from text-only calibration.
- Embeddings,
lm_head, and MTP-sensitive modules are preserved in BF16.
Files
hf_quant_config.json— ModelOpt quantization metadata used by compatible inference engines.modelopt_exclusions.json— exact exclusion list used during quantization.config.json, tokenizer, and processor files are copied from the base model and patched only as required for export consistency.
Validation Status
This release is a quantized checkpoint, not a new fine-tune. It does not claim quality improvement over BF16.
Runtime smoke testing on NVIDIA GB10 / SM121 completed with the companion container recipe at r0b0tlab/agents-a1-nvfp4-sm121-vllm.
Validated evidence includes:
- Container audit on NVIDIA GB10 with CUDA capability
[12, 1]. - vLLM extension imports:
vllm._C,vllm._C_stable_libtorch,vllm._moe_C. - Native FP4 support checks:
cutlass_scaled_mm_supports_fp4(121)and(120)returntrue. - Runtime log selection of
FlashInferCutlassNvFp4LinearKernelandFLASHINFER_CUTLASSfor NVFP4/MoE. - OpenAI-compatible
/v1/modelsand/v1/chat/completionsprobes against the running container. - Lightweight live-container benchmark evidence in the companion repo: GSM8K 50-question lm-eval run at 98.00% exact match, direct HumanEval 50-question run at 48/50 (96.00%), c1/c2/c4/c8 concurrency sweep with 100% request success, and GPU telemetry including power draw.
Benchmark snapshot
The benchmark run agents-a1-nvfp4-gsm8k50-humaneval50-20260701T194211Z used the live OpenAI-compatible endpoint at http://127.0.0.1:18080/v1 with chat_template_kwargs.enable_thinking=false for scored requests.
| Suite | Harness | Samples | Result | Notes |
|---|---|---|---|---|
| GSM8K | lm-eval gsm8k |
50 | strict 98.00%, flexible 98.00% | num_concurrent=2 |
| HumanEval | direct OpenAI-compatible evaluator | 50 | 48/50 (96.00%) | code extracted/evaluated locally |
| HumanEval | stock lm-eval humaneval |
50 | 0.00% | preserved as harness-interference evidence; stock stop rules truncate chat-model output |
Combined telemetry across GSM8K, HumanEval, and direct HumanEval averaged 27.88 W GPU power draw, 70.75% GPU utilization, and 58.83°C, with maxima of 36.00 W, 96.00%, and 65.00°C over 166 telemetry samples. The c8 concurrency sweep completed 24/24 requests successfully. See the companion repo's benchmarks/agents-a1-nvfp4-gsm8k50-humaneval50-20260701T194211Z/ directory for raw logs, samples, summaries, telemetry CSVs, and MANIFEST.sha256.
SM121 container quick start
docker run --rm --gpus all --ipc=host \
--name agents-a1-nvfp4-vllm \
-p 18080:8000 \
-e MODEL_ID=r0b0tlab/Agents-A1-NVFP4 \
ghcr.io/r0b0tlab/agents-a1-nvfp4-sm121-vllm:latest
For fully pinned local reproduction, clone/download this model and mount it read-only:
docker run --rm --gpus all --ipc=host \
--name agents-a1-nvfp4-vllm \
-p 18080:8000 \
-v /path/to/Agents-A1-NVFP4:/models/Agents-A1-NVFP4:ro \
ghcr.io/r0b0tlab/agents-a1-nvfp4-sm121-vllm:latest
Limitations
- Calibration is text-only; vision components are preserved in BF16 rather than calibrated.
- This card does not claim benchmark parity until BF16-vs-NVFP4 evaluations are published.
- Runtime support depends on the inference engine's ModelOpt/NVFP4 implementation.
Citation
@misc{internscience_agents_a1_2026,
title = {Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
author = {InternScience},
year = {2026},
url = {https://huggingface.co/InternScience/Agents-A1}
}
License
This quantized checkpoint follows the base model license, Apache-2.0. Users must also comply with the licenses and terms for the base model, calibration data, NVIDIA ModelOpt, Hugging Face libraries, and any inference runtime used.
- Downloads last month
- -
Model tree for r0b0tlab/Agents-A1-NVFP4
Base model
InternScience/Agents-A1