Agents-A1 GGUF

GGUF quantizations of InternScience/Agents-A1 — a 35B Mixture-of-Experts agentic model (Qwen3.5-MoE architecture) built for long-horizon search, engineering, scientific research, instruction-following, and tool-calling.

Files were produced from the BF16 Hugging Face checkpoint with a patched llama.cpp build that supports the qwen35moe architecture. Each quant uses an importance matrix (imatrix) built from coding/instruction-chat calibration data, and every file was benchmarked against the BF16 GGUF reference (PPL, KL-divergence, top-1 agreement).

These are text-only GGUFs. The base model is multimodal (vision + video), but no mmproj projector is shipped here, so image/video input is not available with these files. Use them for text and agentic/tool-calling workloads.

Model summary

Base model InternScience/Agents-A1 (paper · homepage · GitHub)
Architecture Qwen3.5-MoE, hybrid linear/full attention (full attention every 4th layer)
Parameters ~35B total, ~3B active per token (A3B-class)
Experts 256 experts, 8 active + 1 shared per token
Layers 40 transformer layers + 1 MTP layer
Context length 262,144 (256K) native
Language English
License Apache-2.0 (inherited from base)
Quantized by LordNeel

Which file should I pick?

Goal File Notes
Best small general-purpose quant agents-a1-IQ4_XS.gguf Strong quality for size, broad llama.cpp compatibility.
Best single-user MTP throughput agents-a1-IQ4_XS-MTP-graft-headQ6.gguf IQ4_XS body + Q6_K MTP block; 1.22× over target-only at n_max=2.
Highest MTP draft acceptance agents-a1-Q4_K_M-MTP-graft-headQ6.gguf (SPEC_DRAFT_N_MAX=1) 91.46% acceptance, still 1.15× over target-only.
Fast Blackwell FP4 path agents-a1-NVFP4.gguf Tested on RTX PRO 6000 Blackwell. Needs runtime support for GGML_TYPE_NVFP4.
Safer quality step up agents-a1-Q5_K_M.gguf Lower KLD than IQ4_XS, larger size.
Closest to BF16 by KLD agents-a1-Q6_K.gguf Best KLD in this eval set.
High-precision archival agents-a1-Q8_0.gguf Largest quant.

Sizing: for full GPU offload, give yourself roughly file size + KV cache of VRAM. K-quants (Q4_K_M, Q5_K_M, Q6_K) are the most portable. IQ4_XS is an I-quant and benefits from the bundled imatrix. NVFP4 is the fastest prefill path but needs a Blackwell-class GPU and a recent FP4-capable llama.cpp build.

Files

Quant File size Notes
Q3_K_M 16.76 GB Smallest included quant.
IQ4_XS 18.73 GB Recommended compact quant.
IQ4_XS-MTP-graft-headQ6 19.42 GB IQ4_XS body + integrated Q6_K/F32 MTP block.
NVFP4 19.72 GB Blackwell-oriented FP4 GGUF; output head kept at Q6_K by quality rule.
Q4_K_M 21.17 GB Standard K-quant.
Q4_K_M-MTP-graft-headQ6 21.86 GB Q4_K_M body + integrated Q6_K/F32 MTP block.
Q5_K_M 24.73 GB Strong quality/size tradeoff.
Q6_K 28.51 GB Lowest mean KLD in this run.
Q8_0 36.90 GB Highest-precision quant.

Download

pip install -U "huggingface_hub[cli]"

# download a single quant into ./agents-a1
hf download LordNeel/Agents-A1-GGUF agents-a1-IQ4_XS.gguf --local-dir ./agents-a1

You generally want a recent llama.cpp build with qwen35moe support; the NVFP4 and MTP files need newer builds still (see the relevant sections below).

Usage

Standard inference with the recommended compact quant:

llama-server \
  -m agents-a1-IQ4_XS.gguf \
  -ngl 99 \
  -c 8192 \
  -b 4096 \
  -ub 512 \
  --flash-attn on

-c 8192 is just a starting point — the model's native context is 256K, so raise -c as your VRAM allows.

NVFP4 (Blackwell):

llama-server \
  -m agents-a1-NVFP4.gguf \
  -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on

The NVFP4 artifact is a standard GGUF using the NVFP4 tensor type, but runtime support is newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a llama.cpp build reporting BLACKWELL_NATIVE_FP4 = 1.

MTP / speculative decoding (single-user throughput):

LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \
LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \
LLAMA_MTP_DRAFT_TOP_K=1 \
LLAMA_MTP_DRAFT_TOP_P=1 \
LLAMA_MTP_DRAFT_TEMP=1 \
llama-server \
  -m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \
  -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on \
  --reasoning off \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-n-min 0 \
  --spec-draft-backend-sampling

For the high-acceptance profile, change --spec-draft-n-max 2 to --spec-draft-n-max 1.

Python with llama-cpp-python:

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="LordNeel/Agents-A1-GGUF",
    filename="agents-a1-IQ4_XS.gguf",
)

Prompt format

Agents-A1 uses a Qwen-style ChatML template (embedded in the GGUF, so llama-server/llama-cli chat endpoints apply it automatically):

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant

The model natively supports function calling / tool use — see the base model card for agentic and tool-calling details.

Metrics

Hardware and runtime profile:

  • GPU: single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload
  • llama.cpp flags: -ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3
  • PPL: llama-perplexity, context 2048, 64 rendered eval conversations, 3 chunks
  • KLD: approximate KL(P_BF16 || P_quant) over top-64 next-token distributions on 32 prompts

The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are the more useful quant-to-BF16 quality signals here.

Model Size GB Prompt tok/s Gen tok/s PPL PPL delta KLD mean KLD p95 Top-1 match
BF16 reference 69.38 3418.9 161.8 1.3031 0.0000 0.0000 0.0000 32/32
Q3_K_M 16.76 6779.5 269.0 1.3101 +0.0070 0.0655 0.2155 28/32
IQ4_XS 18.73 7719.5 258.1 1.3038 +0.0007 0.0151 0.0654 29/32
NVFP4 19.72 9064.0 265.1 1.3063 +0.0032 0.0420 0.1473 31/32
Q4_K_M 21.17 7230.8 262.6 1.3016 -0.0015 0.1225 0.3349 27/32
Q5_K_M 24.73 7021.4 257.9 1.3041 +0.0010 0.0091 0.0335 30/32
Q6_K 28.51 6294.0 244.6 1.3040 +0.0009 0.0049 0.0178 32/32
Q8_0 36.90 7431.3 222.7 1.3036 +0.0005 0.0053 0.0063 30/32

Charts

Throughput by quant

Quality vs size

Mean KLD

PPL delta

Raw metric files are in metrics/; KLD reports, checksums, and the MTP audit are in reports/.

MTP (Multi-Token Prediction) Q4 variants

The upstream Agents-A1 checkpoint used for the first GGUF release advertises MTP in config but does not ship mtp.* / blk.40.* tensors. The two MTP Q4 variants here graft in the Agents-A1 MTPLX MTP sidecar from wang-yang/Agents-A1-MTPLX-Q4, then convert it with llama.cpp's Qwen3.5-MoE MTP path. The dense MTP block is preserved at Q6_K while the model body is quantized to IQ4_XS or Q4_K_M.

Structural checks for both MTP GGUFs:

Check Value
GGUF tensors 753
qwen35moe.block_count 41
qwen35moe.nextn_predict_layers 1
blk.40.* MTP tensors 20
blk.40.nextn.* tensors 4

Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU, PARALLEL=1, CTX_SIZE=8192, streaming chat completions, 12 requests, 128 max tokens, temperature=0, top_p=1.

Quant Mode Aggregate tok/s Speedup vs target-only Draft acceptance Mean accepted length Acceptance by position
IQ4_XS-MTP target-only 224.59 1.00× n/a n/a n/a
IQ4_XS-MTP draft-mtp, n_max=2 275.03 1.22× 76.51% 2.52 (0.830, 0.692)
IQ4_XS-MTP draft-mtp, n_max=1 259.58 1.16× 86.47% 1.86 (0.865)
Q4_K_M-MTP target-only 230.48 1.00× n/a n/a n/a
Q4_K_M-MTP draft-mtp, n_max=2 273.80 1.19× 77.18% 2.53 (0.847, 0.687)
Q4_K_M-MTP draft-mtp, n_max=1 264.88 1.15× 91.46% 1.91 (0.915)

MTP speedup and acceptance

Recommended low-latency / single-user throughput profile: SPEC_DRAFT_N_MAX=2. Recommended high-acceptance fallback: SPEC_DRAFT_N_MAX=1.

Detailed MTP evidence:

  • reports/agents-a1-mtp-q4-profile-summary.md
  • reports/agents-a1-mtp-q4-profile-summary.json
  • reports/mtp-weights-audit.json (audit of the config-only upstream snapshot)
  • configs/mtp_profiles.yaml

Provenance & credits

  • Base model: InternScience/Agents-A1Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent (arXiv:2606.30616)
  • MTP source: wang-yang/Agents-A1-MTPLX-Q4 sidecar, grafted onto the base checkpoint
  • Quantization source: BF16 GGUF converted from the Hugging Face checkpoint
  • Calibration: coding/instruction-chat data rendered with the model chat template (imatrix)
  • Quantizer: patched llama.cpp with Qwen3.5-MoE and NVFP4 support
  • License: Apache-2.0, inherited from the base model

Citation

If you use these quantizations, please cite the base model:

@article{agentsa1_2026,
  title   = {Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
  author  = {InternScience},
  journal = {arXiv preprint arXiv:2606.30616},
  year    = {2026}
}
Downloads last month
-
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LordNeel/Agents-A1-GGUF

Quantized
(17)
this model

Paper for LordNeel/Agents-A1-GGUF