Instructions to use LordNeel/Agents-A1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LordNeel/Agents-A1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="LordNeel/Agents-A1-GGUF",
	filename="agents-a1-IQ4_XS-MTP-graft-headQ6.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use LordNeel/Agents-A1-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Use Docker

docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use LordNeel/Agents-A1-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LordNeel/Agents-A1-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LordNeel/Agents-A1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M

Ollama
How to use LordNeel/Agents-A1-GGUF with Ollama:
```
ollama run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
```

Unsloth Studio

How to use LordNeel/Agents-A1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LordNeel/Agents-A1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LordNeel/Agents-A1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for LordNeel/Agents-A1-GGUF to start chatting

How to use LordNeel/Agents-A1-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "LordNeel/Agents-A1-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use LordNeel/Agents-A1-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default LordNeel/Agents-A1-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use LordNeel/Agents-A1-GGUF with Docker Model Runner:
```
docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
```

Lemonade

How to use LordNeel/Agents-A1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull LordNeel/Agents-A1-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Agents-A1-GGUF-Q4_K_M

List all available models

lemonade list

Agents-A1 GGUF

GGUF quantizations of InternScience/Agents-A1 — a 35B Mixture-of-Experts agentic model (Qwen3.5-MoE architecture) built for long-horizon search, engineering, scientific research, instruction-following, and tool-calling.

Files were produced from the BF16 Hugging Face checkpoint with a patched llama.cpp build that supports the qwen35moe architecture. Each quant uses an importance matrix (imatrix) built from coding/instruction-chat calibration data, and every file was benchmarked against the BF16 GGUF reference (PPL, KL-divergence, top-1 agreement).

These are text-only GGUFs. The base model is multimodal (vision + video), but no mmproj projector is shipped here, so image/video input is not available with these files. Use them for text and agentic/tool-calling workloads.

Model summary


Base model	InternScience/Agents-A1 (paper · homepage · GitHub)
Architecture	Qwen3.5-MoE, hybrid linear/full attention (full attention every 4th layer)
Parameters	~35B total, ~3B active per token (A3B-class)
Experts	256 experts, 8 active + 1 shared per token
Layers	40 transformer layers + 1 MTP layer
Context length	262,144 (256K) native
Language	English
License	Apache-2.0 (inherited from base)
Quantized by	LordNeel

Which file should I pick?

Goal	File	Notes
Best small general-purpose quant	`agents-a1-IQ4_XS.gguf`	Strong quality for size, broad `llama.cpp` compatibility.
Best single-user MTP throughput	`agents-a1-IQ4_XS-MTP-graft-headQ6.gguf`	IQ4_XS body + Q6_K MTP block; 1.22× over target-only at `n_max=2`.
Highest MTP draft acceptance	`agents-a1-Q4_K_M-MTP-graft-headQ6.gguf` (`SPEC_DRAFT_N_MAX=1`)	91.46% acceptance, still 1.15× over target-only.
Fast Blackwell FP4 path	`agents-a1-NVFP4.gguf`	Tested on RTX PRO 6000 Blackwell. Needs runtime support for `GGML_TYPE_NVFP4`.
Safer quality step up	`agents-a1-Q5_K_M.gguf`	Lower KLD than IQ4_XS, larger size.
Closest to BF16 by KLD	`agents-a1-Q6_K.gguf`	Best KLD in this eval set.
High-precision archival	`agents-a1-Q8_0.gguf`	Largest quant.

Sizing: for full GPU offload, give yourself roughly file size + KV cache of VRAM. K-quants (Q4_K_M, Q5_K_M, Q6_K) are the most portable. IQ4_XS is an I-quant and benefits from the bundled imatrix. NVFP4 is the fastest prefill path but needs a Blackwell-class GPU and a recent FP4-capable llama.cpp build.

Files

Quant	File size	Notes
Q3_K_M	16.76 GB	Smallest included quant.
IQ4_XS	18.73 GB	Recommended compact quant.
IQ4_XS-MTP-graft-headQ6	19.42 GB	IQ4_XS body + integrated Q6_K/F32 MTP block.
NVFP4	19.72 GB	Blackwell-oriented FP4 GGUF; output head kept at Q6_K by quality rule.
Q4_K_M	21.17 GB	Standard K-quant.
Q4_K_M-MTP-graft-headQ6	21.86 GB	Q4_K_M body + integrated Q6_K/F32 MTP block.
Q5_K_M	24.73 GB	Strong quality/size tradeoff.
Q6_K	28.51 GB	Lowest mean KLD in this run.
Q8_0	36.90 GB	Highest-precision quant.

Download

pip install -U "huggingface_hub[cli]"

# download a single quant into ./agents-a1
hf download LordNeel/Agents-A1-GGUF agents-a1-IQ4_XS.gguf --local-dir ./agents-a1

You generally want a recent llama.cpp build with qwen35moe support; the NVFP4 and MTP files need newer builds still (see the relevant sections below).

Usage

Standard inference with the recommended compact quant:

llama-server \
  -m agents-a1-IQ4_XS.gguf \
  -ngl 99 \
  -c 8192 \
  -b 4096 \
  -ub 512 \
  --flash-attn on

-c 8192 is just a starting point — the model's native context is 256K, so raise -c as your VRAM allows.

NVFP4 (Blackwell):

llama-server \
  -m agents-a1-NVFP4.gguf \
  -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on

The NVFP4 artifact is a standard GGUF using the NVFP4 tensor type, but runtime support is newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a llama.cpp build reporting BLACKWELL_NATIVE_FP4 = 1.

MTP / speculative decoding (single-user throughput):

LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \
LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \
LLAMA_MTP_DRAFT_TOP_K=1 \
LLAMA_MTP_DRAFT_TOP_P=1 \
LLAMA_MTP_DRAFT_TEMP=1 \
llama-server \
  -m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \
  -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on \
  --reasoning off \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-n-min 0 \
  --spec-draft-backend-sampling

For the high-acceptance profile, change --spec-draft-n-max 2 to --spec-draft-n-max 1.

Python with llama-cpp-python:

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="LordNeel/Agents-A1-GGUF",
    filename="agents-a1-IQ4_XS.gguf",
)

Prompt format

Agents-A1 uses a Qwen-style ChatML template (embedded in the GGUF, so llama-server/llama-cli chat endpoints apply it automatically):

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant

The model natively supports function calling / tool use — see the base model card for agentic and tool-calling details.

Metrics

Hardware and runtime profile:

GPU: single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload
llama.cpp flags: -ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3
PPL: llama-perplexity, context 2048, 64 rendered eval conversations, 3 chunks
KLD: approximate KL(P_BF16 || P_quant) over top-64 next-token distributions on 32 prompts

The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are the more useful quant-to-BF16 quality signals here.

Model	Size GB	Prompt tok/s	Gen tok/s	PPL	PPL delta	KLD mean	KLD p95	Top-1 match
BF16 reference	69.38	3418.9	161.8	1.3031	0.0000	0.0000	0.0000	32/32
Q3_K_M	16.76	6779.5	269.0	1.3101	+0.0070	0.0655	0.2155	28/32
IQ4_XS	18.73	7719.5	258.1	1.3038	+0.0007	0.0151	0.0654	29/32
NVFP4	19.72	9064.0	265.1	1.3063	+0.0032	0.0420	0.1473	31/32
Q4_K_M	21.17	7230.8	262.6	1.3016	-0.0015	0.1225	0.3349	27/32
Q5_K_M	24.73	7021.4	257.9	1.3041	+0.0010	0.0091	0.0335	30/32
Q6_K	28.51	6294.0	244.6	1.3040	+0.0009	0.0049	0.0178	32/32
Q8_0	36.90	7431.3	222.7	1.3036	+0.0005	0.0053	0.0063	30/32

Charts

Raw metric files are in metrics/; KLD reports, checksums, and the MTP audit are in reports/.

MTP (Multi-Token Prediction) Q4 variants

The upstream Agents-A1 checkpoint used for the first GGUF release advertises MTP in config but does not ship mtp.* / blk.40.* tensors. The two MTP Q4 variants here graft in the Agents-A1 MTPLX MTP sidecar from wang-yang/Agents-A1-MTPLX-Q4, then convert it with llama.cpp's Qwen3.5-MoE MTP path. The dense MTP block is preserved at Q6_K while the model body is quantized to IQ4_XS or Q4_K_M.

Structural checks for both MTP GGUFs:

Check	Value
GGUF tensors	753
`qwen35moe.block_count`	41
`qwen35moe.nextn_predict_layers`	1
`blk.40.*` MTP tensors	20
`blk.40.nextn.*` tensors	4

Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU, PARALLEL=1, CTX_SIZE=8192, streaming chat completions, 12 requests, 128 max tokens, temperature=0, top_p=1.

Quant	Mode	Aggregate tok/s	Speedup vs target-only	Draft acceptance	Mean accepted length	Acceptance by position
IQ4_XS-MTP	target-only	224.59	1.00×	n/a	n/a	n/a
IQ4_XS-MTP	`draft-mtp`, `n_max=2`	275.03	1.22×	76.51%	2.52	`(0.830, 0.692)`
IQ4_XS-MTP	`draft-mtp`, `n_max=1`	259.58	1.16×	86.47%	1.86	`(0.865)`
Q4_K_M-MTP	target-only	230.48	1.00×	n/a	n/a	n/a
Q4_K_M-MTP	`draft-mtp`, `n_max=2`	273.80	1.19×	77.18%	2.53	`(0.847, 0.687)`
Q4_K_M-MTP	`draft-mtp`, `n_max=1`	264.88	1.15×	91.46%	1.91	`(0.915)`

Recommended low-latency / single-user throughput profile: SPEC_DRAFT_N_MAX=2. Recommended high-acceptance fallback: SPEC_DRAFT_N_MAX=1.

Detailed MTP evidence:

reports/agents-a1-mtp-q4-profile-summary.md
reports/agents-a1-mtp-q4-profile-summary.json
reports/mtp-weights-audit.json (audit of the config-only upstream snapshot)
configs/mtp_profiles.yaml

Provenance & credits

Base model: InternScience/Agents-A1 — Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent (arXiv:2606.30616)
MTP source: wang-yang/Agents-A1-MTPLX-Q4 sidecar, grafted onto the base checkpoint
Quantization source: BF16 GGUF converted from the Hugging Face checkpoint
Calibration: coding/instruction-chat data rendered with the model chat template (imatrix)
Quantizer: patched llama.cpp with Qwen3.5-MoE and NVFP4 support
License: Apache-2.0, inherited from the base model

Citation

If you use these quantizations, please cite the base model:

@article{agentsa1_2026,
  title   = {Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
  author  = {InternScience},
  journal = {arXiv preprint arXiv:2606.30616},
  year    = {2026}
}

Downloads last month: -

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for LordNeel/Agents-A1-GGUF

Base model

InternScience/Agents-A1

Quantized

(17)

this model

Paper for LordNeel/Agents-A1-GGUF

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Paper • 2606.30616 • Published 2 days ago • 66