You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Soofi-S-Isar-Preview-EntQuant-3bit Overview

⚠️ Preview / internal checkpoint. Weights and metadata may still change.

Quantized derivative of Soofi-Project/Soofi-S-Isar-Preview. See Quantization details for the recipe; the base model card has the underlying model's full description, training data, and evaluation.

Description

EntQuant-compressed serving variant of Soofi-S-Isar-Preview — one of two chain-of-thought reasoning ("thinking") variants of SOOFI-S (alongside Soofi-S-Rhine-Preview), a sovereign, open-source language model developed by a German research consortium. SOOFI (Sovereign Open Source Foundation Models) is designed to provide a secure, European open-source alternative to US and Chinese AI models for industrial use, featuring strong reasoning and AI-agent capabilities.

This checkpoint compresses to an effective bit size of 3 bits per parameter via EntQuant — a lossless entropy-coding pass on entropy-optimized FP8 codes. Following standard practice, only the Transformer linear weights (attention projections + MoE expert projections) are compressed; the Mamba-2 state-space layers, the embedding table, and the LM head are kept at the base model's precision.

For a non-thinking variant (instruction following without an explicit reasoning trace), see Soofi-S-Instruct-Preview and its EntQuant derivatives. For the other thinking variant, see Soofi-S-Rhine-Preview.

This model is for research and development only (Preview).

License/Terms of Use

Released under a custom license ("Other"). TODO: add the full license text / link — inherits from the base model.

Deployment Geography

Global (open release on the Hugging Face Hub). Development and training infrastructure are located in Europe (see Computational Load on the base model card).

Use Case

Enterprise developers and researchers seeking a sovereign, European open-source LLM for tasks that benefit from explicit step-by-step reasoning (math, logic, planning, complex analysis) and AI-agent / tool-use workflows. English and German are the primary languages. This quantized variant targets cost-effective inference on a single GPU.

Quick start

This repository is self-contained: it ships the model weights, the EntQuant plugin source, a Dockerfile, and a Compose file. Three lines and you have an OpenAI-compatible server:

hf download Soofi-Project/Soofi-S-Isar-Preview-EntQuant-3bit --local-dir ./Soofi-S-Isar-Preview-EntQuant-3bit
cd Soofi-S-Isar-Preview-EntQuant-3bit
docker compose up -d

(hf is HuggingFace's CLI — pip install huggingface_hub if you don't have it. Alternative: git clone works only if you've also installed git-lfs first, otherwise you get tiny pointer files instead of the 13 GB of weights — a common gotcha.)

The server is then live on http://localhost:8000/v1. The model name to send in API requests is Soofi-S-Isar-Preview-EntQuant-3bit.

Behind a corporate proxy? export HTTP_PROXY=http://your-proxy:port HTTPS_PROXY=http://your-proxy:port before docker compose up — the build picks them up via build.args.

Pin a specific GPU on a multi-GPU host? In docker-compose.yml, replace count: 1 with device_ids: ["3"] (index) or device_ids: ["GPU-<uuid-from-nvidia-smi-L>"].

Smoke test:

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Soofi-S-Isar-Preview-EntQuant-3bit",
    "messages": [{"role": "user", "content": "Briefly explain the concept of AI sovereignty."}]
  }'

Override settings via .env (see .env.example in the repo): port, GPU index, max context length (the model supports up to 262144 / 256K), GPU memory fraction. The defaults serve at 32K context and 90% of GPU memory on a single GPU.

Requirements

NVIDIA GPU with compute capability ≥ 9.0 (Hopper / Blackwell) for the production fp8 W8A8 path. Older GPUs work but fall back to a slower bf16 linear path.
~13 GB of GPU memory for the weights, plus KV cache (highly dependent on max-model-len and concurrency).
NVIDIA Container Toolkit; Docker Engine 24+ / Compose v2.

Quantization details

EntQuant (ICML 2026 paper · source) is a weight-only, scheme-agnostic post-training quantization method that optimizes a scale per output channel via LBFGS to minimize

  L = reconstruction_error(x, q(x)) + λ · L1(q(x))

The L1 term concentrates the quantized weight distribution toward low Shannon entropy. The weights stay in their target format (FP8 here) but become highly compressible: at λ ≈ 14.5, the FP8 codes entropy-code to an effective bit size of ~3 bits per parameter (substantially more representable values than fixed 3-bit integer quantization would have).

Scope of compression — following standard practice for hybrid Mamba/Transformer architectures, only the Transformer linear weights are quantized:

Component	Status
Attention projections (q/k/v/o_proj)	✅ EntQuant FP8 → 3-bit-effective entropy-coded
MoE expert projections (w1/w2 per expert)	✅ EntQuant FP8 → 3-bit-effective entropy-coded
Mamba-2 state-space layers (in/out projections, conv1d, A/B/C/D parameters)	❌ Kept at base precision
Token embedding table	❌ Kept at base precision
LM head	❌ Kept at base precision
LayerNorm / RMSNorm weights	❌ Kept at base precision

This checkpoint specifically:

Property	Value
Base model	`Soofi-Project/Soofi-S-Isar-Preview` (bf16)
Storage format	`float-quantized` (compressed-tensors), per-channel FP8 (`e4m3fn`) codes + entropy-coded payload
Quant method	`entquant_coding` (auto-discovered by vLLM via the plugin entry point)
Effective bit-size (Transformer linear weights)	~3 bits/parameter
Resident model size on disk	~15 GB
Decode	nvCOMP ANS GPU decompressor on every forward, into a static scratch reused across MoE layers
Reference numerics	W8A16 (weight-only) by default; W8A8 with `ENTQUANT_LINEAR_COMPUTE=fp8` (on by default in this image)

Important: the 3bit notation refers to the effective compressed bit size (storage cost) of the quantized Transformer linear weights, not 3-bit integer quantization in the conventional sense. The weights themselves are FP8 codes; entropy coding reduces the storage cost to ~3 bits each. At inference time the FP8 codes are decoded back to FP8 (no information loss in the decoding step) and used directly by vLLM's fused W8A8 kernels.

What's in this image

Layer	What
`vllm/vllm-openai:v0.21.0`	vLLM, CUDA 13, torch 2.11, OpenAI server, tokenizer libs
`nvidia-nvcomp-cu12==5.2.0.13`	GPU ANS / zstd decompressor
`entquant-coding` (bundled in this repo)	EntQuant codec + chunked container + decoder
`entquant-vllm` (bundled in this repo)	vLLM plugin: registers `quant_method: entquant_coding`, FULL-graph capturable decode, fp8 W8A8 linear, selective MoE decode

Performance

Measured on B200 (NVIDIA Blackwell), single GPU, full CUDA-graph capture, vLLM 0.21.0:

Batch size	Tokens/s
B=1	TODO — measure on the @3-bit checkpoint specifically
B=16	TODO
B=64	TODO

Reference numbers from the closely-related 3-bit-effective checkpoint are in the project's THROUGHPUT_LOG. We will fill these in here after end-to-end benchmarking of the @2-bit weights.

Release Date

Hugging Face Hub — Preview at https://huggingface.co/Soofi-Project/Soofi-S-Isar-Preview-EntQuant-3bit. TODO: final release date (MM/DD/YYYY).

Reference(s)

Project: https://soofi.info
Base model: Soofi-Project/Soofi-S-Isar-Preview
EntQuant paper (ICML 2026): https://icml.cc/virtual/2026/poster/66714
EntQuant compressor source: https://github.com/merantix-momentum/entquant
Related models: see Related models.

Model Architecture

Inherits the architecture of the base model unchanged.

Architecture Type: Transformer-based hybrid Mixture-of-Experts (MoE) with Mamba-2 state-space (SSM) layers and attention layers.
Network Architecture: Custom Hybrid Mamba-2/MoE (Nemotron-style), designed from scratch — 23 Mamba-2/MoE layers + 6 attention layers; 128 routing experts + 1 shared expert per MoE layer; 6 experts activated per token.
Number of model parameters: 3.0×10^10 total (30B), with ~3.5B active parameters during inference.

This model was developed from scratch (no parent model); the quantization is applied post-training to the bf16 base.

Computational Load

See the base model card for training compute, energy and emissions. Inference on this quantized variant runs comfortably on a single B200 / H100 (and works on smaller GPUs with reduced KV cache).

Input

Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Chat/ChatML-style messages via the embedded chat template. No system prompt is required (none is injected by default). Context length up to 262144 (256K) — capped at 32768 by default in this image, raise MAX_MODEL_LEN in .env to go higher.

Output

Output Type(s): Text
Output Format(s): String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Emits explicit <think>...</think> reasoning traces before the final answer; vLLM's qwen3 reasoning parser (wired up in the image) splits this into a separate reasoning_content field, leaving content clean. Supports the model's native tool-calling format (<tool_call><function=...><parameter=...>...</parameter></function></tool_call>) — vLLM's qwen3_xml tool-call parser is also wired up so OpenAI-style tool_choice: "auto" works out of the box. Allow a generous max_tokens budget — reasoning traces can be verbose.

Software Integration

Runtime Engine(s):

vLLM 0.21.0 with the entquant_coding plugin (bundled in this repo; auto-discovered).
Other engines (HF transformers, llama.cpp/Ollama) do not load this checkpoint; the on-disk format is vLLM-specific. For HF transformers use the bf16 base model; for llama.cpp use the GGUF variant.

Supported Hardware Microarchitecture Compatibility:

NVIDIA Hopper (H100, H200) or Blackwell (B200, RTX PRO 6000) — for the production W8A8 fp8 path.
Ampere and newer work via the bf16 linear fallback (set ENTQUANT_LINEAR_COMPUTE=bf16); some throughput is lost.

Preferred/Supported Operating System(s): Linux.

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment.

Model Version(s)

This repo: Soofi-S-Isar-Preview-EntQuant-3bit — EntQuant, ~3 effective bits, vLLM-only.
Base: Soofi-S-Isar-Preview (bf16, unquantized).
Other quantized derivatives: see Related models.

Installation & Usage

Docker / Compose (recommended)

See Quick start above.

Direct vLLM (if you've already installed the plugin)

pip install entquant-coding[gpu] entquant-vllm   # once the packages are on PyPI
# or from this repo:
pip install ./entquant-coding ./entquant-vllm --no-deps

ENTQUANT_ANS_GRAPH=1 ENTQUANT_LINEAR_COMPUTE=fp8 ENTQUANT_MOE_SELECTIVE=1 \
vllm serve ./ \
  --trust-remote-code \
  --max-model-len 32768 \
  --served-model-name Soofi-S-Isar-Preview-EntQuant-3bit \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml

OpenAI client (Python)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="Soofi-S-Isar-Preview-EntQuant-3bit",
    messages=[{"role": "user", "content": "Briefly explain the concept of AI sovereignty."}],
)
print(resp.choices[0].message.content)

Training, Testing, and Evaluation Datasets

See the base model card — quantization is post-training and does not change training data. The quantization process itself is data-free.

Dataset Overview (from the base model)

Total Size: ~2.5×10^13 tokens (25 trillion).
Languages: English, German (primary); French, Italian, Spanish (limited). English acts as the pivot language.
Knowledge Cutoff: End of 2025.
Training Start: April 2026.

Quantization data

This is a data-free post-training quantization — no calibration set is used. EntQuant optimizes scales purely from the weights, with no forward pass through any data. No personal data is used in the quantization process.

Evaluation Dataset

TODO: add accuracy delta vs. the bf16 base on standard benchmarks (MMLU, etc.) once measured. Expected: within EntQuant's published accuracy band at 2-bit effective rate (≤ a few % loss on most benchmarks; see the EntQuant paper for the reference rates on Llama-2-7B and Llama-3-8B).

Inference

Acceleration Engine: vLLM 0.21.0 with the bundled entquant_coding plugin (FULL CUDA-graph capture; on-the-fly nvCOMP ANS decode; fp8 W8A8 cutlass matmul; selective expert decode at B=1).
Specific Test Hardware: Validated on NVIDIA B200 (DGX, 183 GiB HBM3e). Also tested on RTX PRO 6000 Blackwell (97 GiB).

Ethical Considerations

The SOOFI consortium believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information, see the Model Card++ subcards below. Please report model quality, risk, security vulnerabilities, or concerns to contact@soofi.info.

The quantization step (EntQuant) is a deterministic, weight-only transformation; it does not introduce new training data, new behaviors, or new alignment properties beyond what the base model already exhibits. Any biases, capability ceilings, and safety considerations from the base model apply unchanged here.

Bias Subcard

Field	Response
Participation considerations from adversely impacted groups in model design and testing	Inherited from the base model — see its card. Quantization does not affect bias.
Measures taken to mitigate against unwanted bias	See the base model card.
Bias Metric (if measured)	See the base model card.

Explainability Subcard

Field	Response
Intended Task/Domain	Reasoning-heavy tasks (math, logic, planning, analysis), AI-agent/tool use
Model Type	Hybrid Mixture-of-Experts (MoE) autoregressive reasoning ("thinking") language model, EntQuant-quantized to ~3 effective bits
Intended Users	Enterprise developers and researchers
Output	Text (String), with an explicit `<think>` reasoning trace before the answer
Describe how the model works	Generates text autoregressively; emits chain-of-thought (`<think>` block) before the final answer; a router activates 6 of 128 experts per token across hybrid Mamba-2/MoE and attention layers; quantized weights are decoded on the fly via nvCOMP ANS into a fixed GPU scratch and consumed by fused fp8 W8A8 matmuls
Technical Limitations	Preview checkpoint; non-primary languages (FR/IT/ES) are limited; ~2-bit-effective quantization may introduce a small accuracy delta vs. the bf16 reference (TODO: quantify); requires vLLM with the bundled plugin (cannot be loaded by stock HF `transformers`)
Verified to have met prescribed quality standards	TODO
Performance Metrics	TODO — accuracy delta vs. bf16 base on standard benchmarks
Potential Known Risks and Mitigation	May generate incorrect, biased, or unsafe content; apply use-case-specific testing and guardrails before deployment
Terms of Use/Licensing	Other (see License/Terms of Use)

Privacy Subcard

Field	Response
Generatable or reverse engineerable personal data?	TODO — see base model card
Personal data used to create this model?	Quantization is data-free; for training-data privacy see the base model card
Was consent obtained for any personal data used?	See the base model card
How often is dataset reviewed?	See the base model card
Was data from user interactions with the AI model used to train the model?	No
Is there provenance for all datasets used in training?	See the base model card
Applicable Privacy Policy	TODO

Safety & Security Subcard

Field	Response
Model Application Field(s)	Industrial use; customer service; general-purpose assistant and agent applications
Describe the life critical impact (if present)	None intended. Not for use in life-critical or safety-critical decision-making without independent validation
Use Case Restrictions	Abide by the applicable license agreement (see License/Terms of Use)
Model and dataset restrictions	TODO

Related models

Base model

Soofi-Project/Soofi-S-Isar-Preview — bf16 reference; use with HF transformers.

Variants of this checkpoint

Soofi-Project/Soofi-S-Isar-Preview-EntQuant-2bit — ~2 effective bits (highest compression, largest accuracy delta from bf16).
Soofi-Project/Soofi-S-Isar-Preview-EntQuant-4bit — ~4 effective bits (least compression, closest to bf16 numerics).
Soofi-Project/Soofi-S-Isar-Preview-FP8 — uncompressed FP8 (no EntQuant).
Soofi-Project/Soofi-S-Isar-Preview-GGUF — for llama.cpp / Ollama.

Sibling variants of the base

Soofi-Project/Soofi-S-Instruct-Preview — non-thinking (direct-answer) variant and its EntQuant derivatives.
Soofi-Project/Soofi-S-Rhine-Preview — the other thinking variant and its EntQuant derivatives.

Citation

If you use this model, please cite both the base model and the EntQuant paper:

@misc{soofi_s_isar_preview,
  title  = {Soofi-S-Isar-Preview},
  author = {SOOFI Consortium},
  year   = {2026},
  url    = {https://huggingface.co/Soofi-Project/Soofi-S-Isar-Preview}
}

@misc{soofi_s_isar_preview_entquant_3bit,
  title  = {Soofi-S-Isar-Preview-EntQuant-3bit (EntQuant-quantized)},
  author = {SOOFI Consortium},
  year   = {2026},
  url    = {https://huggingface.co/Soofi-Project/Soofi-S-Isar-Preview-EntQuant-3bit}
}

@inproceedings{entquant_icml2026,
  title     = {EntQuant: Entropy-Optimized Post-Training Quantization},
  author    = {Merantix Momentum},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  url       = {https://icml.cc/virtual/2026/poster/66714}
}