Qwen2.5 72b Instruct (GGUF, Q4_K_M)

Production-ready GGUF quantization of Qwen/Qwen2.5-72B-Instruct for distributed text generation and conversation — powered by the Aether edge inference runtime on Edgework.ai.

Model Details

Property Value
Base model Qwen/Qwen2.5-72B-Instruct
Parameters 72B
Architecture Qwen2
Quantization Q4_K_M
Format GGUF
Size ~43 GB
License apache-2.0

Usage

With llama.cpp

./llama-cli -m Qwen2.5-72B-Instruct-Q4_K_M.gguf -p "Your prompt here" -n 256

With Aether (Distributed Inference)

This model is deployed across the Aether distributed inference network. Weights are layer-sharded and distributed across multiple edge nodes for parallel inference.

Also available: .knot (sovereign format)

This repo ships qwen2.5-72b-instruct.knot — the model weights in the KNOT container that the Aether distributed-inference runtime loads natively (the GGUF, when present, sits right beside it). A KNOT is a single self-describing file with a JSON table-of-contents, so any single tensor is one HTTP Range request — ideal for streaming weights to edge nodes.

GGUF KNOT
Container format-specific header single file, JSON table-of-contents
Per-tensor fetch whole-file oriented one tensor = one Range request
Ecosystem broad (llama.cpp, …) Aether / Gnosis runtime
huggingface-cli download forkjoin-ai/qwen2.5-72b-instruct-gguf qwen2.5-72b-instruct.knot --local-dir ./knots

Full format spec: KNOT_FORMAT.md. Inspect the header with bun run open-source/bitwise/scripts/dump-knot.ts qwen2.5-72b-instruct.knot.

Deployment Architecture

This model runs on the Aether distributed inference runtime — a custom engine that shards model layers across multiple nodes for parallel execution:

  1. Coordinator receives requests and manages token generation
  2. Layer nodes each hold a subset of model layers (6 nodes for this model)
  3. Hidden states flow between nodes via gRPC
  4. Zero cold start via warm pool scheduling

Deployed via Edgework.ai — bringing fast, cheap, and private inference as close to the user as possible.

About

Published by AFFECTIVELY · Managed by @buley

We quantize and publish production-ready models for distributed edge inference via the Aether runtime. Every release is tested for correctness and stability before publication.

Downloads last month
184
GGUF
Model size
73B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for forkjoin-ai/qwen2.5-72b-instruct-gguf

Base model

Qwen/Qwen2.5-72B
Quantized
(85)
this model