- Soofi-S-Instruct-Preview-EntQuant-2bit Overview
- Description
- License/Terms of Use
- Deployment Geography
- Use Case
- Quick start
- Quantization details
- Release Date
- Reference(s)
- Model Architecture
- Computational Load
- Input
- Output
- Software Integration
- Model Version(s)
- Installation & Usage
- Training, Testing, and Evaluation Datasets
- Inference
- Ethical Considerations
- Related models
- Citation
- Description
Soofi-S-Instruct-Preview-EntQuant-2bit Overview
⚠️ Preview / internal checkpoint. Weights and metadata may still change.
Quantized derivative of
Soofi-Project/Soofi-S-Instruct-Preview. See Quantization details for the recipe; the base model card has the underlying model's full description, training data, and evaluation.
Description
EntQuant-compressed serving variant of Soofi-S-Instruct-Preview — the instruction-tuned, non-thinking variant of SOOFI-S, a sovereign, open-source language model developed by a German research consortium. SOOFI (Sovereign Open Source Foundation Models) is designed to provide a secure, European open-source alternative to US and Chinese AI models for industrial use, featuring strong reasoning and AI-agent capabilities.
This checkpoint compresses to an effective bit size of 2 bits per parameter via EntQuant — a lossless entropy-coding pass on entropy-optimized FP8 codes. Following standard practice, only the Transformer linear weights (attention projections + MoE expert projections) are compressed; the Mamba-2 state-space layers, the embedding table, and the LM head are kept at the base model's precision.
For explicit chain-of-thought reasoning, use the thinking variants Soofi-S-Isar-Preview and Soofi-S-Rhine-Preview (and their quantized derivatives).
This model is for research and development only (Preview).
License/Terms of Use
Released under a custom license ("Other"). TODO: add the full license text / link — inherits from the base model.
Deployment Geography
Global (open release on the Hugging Face Hub). Development and training infrastructure are located in Europe (see Computational Load on the base model card).
Use Case
Enterprise developers and researchers seeking a sovereign, European open-source LLM for industrial use: general assistant tasks, instruction following, and AI-agent / tool-use workflows. English and German are the primary languages. This quantized variant targets cost-effective inference on a single GPU.
Quick start
This repository is self-contained: it ships the model weights, the EntQuant plugin source, a Dockerfile, and a Compose file. Three lines and you have an OpenAI-compatible server:
hf download Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-2bit --local-dir ./Soofi-S-Instruct-Preview-EntQuant-2bit
cd Soofi-S-Instruct-Preview-EntQuant-2bit
docker compose up -d
(hf is HuggingFace's CLI — pip install huggingface_hub if you don't have it. Alternative: git clone works only if you've also installed git-lfs first, otherwise you get tiny pointer files instead of the 13 GB of weights — a common gotcha.)
The server is then live on http://localhost:8000/v1. The model name to send in API requests is Soofi-S-Instruct-Preview-EntQuant-2bit.
Behind a corporate proxy? export HTTP_PROXY=http://your-proxy:port HTTPS_PROXY=http://your-proxy:port before docker compose up — the build picks them up via build.args.
Pin a specific GPU on a multi-GPU host? In docker-compose.yml, replace count: 1 with device_ids: ["3"] (index) or device_ids: ["GPU-<uuid-from-nvidia-smi-L>"].
Smoke test:
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Soofi-S-Instruct-Preview-EntQuant-2bit",
"messages": [{"role": "user", "content": "Briefly explain the concept of AI sovereignty."}]
}'
Override settings via .env (see .env.example in the repo): port, GPU index, max context length (the model supports up to 262144 / 256K), GPU memory fraction. The defaults serve at 32K context and 90% of GPU memory on a single GPU.
Requirements
- NVIDIA GPU with compute capability ≥ 9.0 (Hopper / Blackwell) for the production fp8 W8A8 path. Older GPUs work but fall back to a slower bf16 linear path.
- ~13 GB of GPU memory for the weights, plus KV cache (highly dependent on max-model-len and concurrency).
- NVIDIA Container Toolkit; Docker Engine 24+ / Compose v2.
Quantization details
EntQuant (ICML 2026 paper · source) is a weight-only, scheme-agnostic post-training quantization method that optimizes a scale per output channel via LBFGS to minimize
L = reconstruction_error(x, q(x)) + λ · L1(q(x))
The L1 term concentrates the quantized weight distribution toward low Shannon entropy. The weights stay in their target format (FP8 here) but become highly compressible: at λ ≈ 58, the FP8 codes retain ≈ 35 distinct values (vs. 256 unconstrained) and entropy-code to an effective bit size of ~2 bits per parameter.
Scope of compression — following standard practice for hybrid Mamba/Transformer architectures, only the Transformer linear weights are quantized:
| Component | Status |
|---|---|
| Attention projections (q/k/v/o_proj) | ✅ EntQuant FP8 → 2-bit-effective entropy-coded |
| MoE expert projections (w1/w2 per expert) | ✅ EntQuant FP8 → 2-bit-effective entropy-coded |
| Mamba-2 state-space layers (in/out projections, conv1d, A/B/C/D parameters) | ❌ Kept at base precision |
| Token embedding table | ❌ Kept at base precision |
| LM head | ❌ Kept at base precision |
| LayerNorm / RMSNorm weights | ❌ Kept at base precision |
This checkpoint specifically:
| Property | Value |
|---|---|
| Base model | Soofi-Project/Soofi-S-Instruct-Preview (bf16) |
| Storage format | float-quantized (compressed-tensors), per-channel FP8 (e4m3fn) codes + entropy-coded payload |
| Quant method | entquant_coding (auto-discovered by vLLM via the plugin entry point) |
| Effective bit-size (Transformer linear weights) | ~2 bits/parameter |
| Resident model size on disk | ~13 GB |
| Decode | nvCOMP ANS GPU decompressor on every forward, into a static scratch reused across MoE layers |
| Reference numerics | W8A16 (weight-only) by default; W8A8 with ENTQUANT_LINEAR_COMPUTE=fp8 (on by default in this image) |
Important: the 2bit notation refers to the effective compressed bit size (storage cost) of the quantized Transformer linear weights, not 2-bit integer quantization in the conventional sense. The weights themselves are FP8 codes; entropy coding reduces the storage cost to ~2 bits each. At inference time the FP8 codes are decoded back to FP8 (no information loss in the decoding step) and used directly by vLLM's fused W8A8 kernels.
What's in this image
| Layer | What |
|---|---|
vllm/vllm-openai:v0.21.0 |
vLLM, CUDA 13, torch 2.11, OpenAI server, tokenizer libs |
nvidia-nvcomp-cu12==5.2.0.13 |
GPU ANS / zstd decompressor |
entquant-coding (bundled in this repo) |
EntQuant codec + chunked container + decoder |
entquant-vllm (bundled in this repo) |
vLLM plugin: registers quant_method: entquant_coding, FULL-graph capturable decode, fp8 W8A8 linear, selective MoE decode |
Performance
Measured on B200 (NVIDIA Blackwell), single GPU, full CUDA-graph capture, vLLM 0.21.0:
| Batch size | Tokens/s |
|---|---|
| B=1 | TODO — measure on the @2-bit checkpoint specifically |
| B=16 | TODO |
| B=64 | TODO |
Reference numbers from the closely-related 3-bit-effective checkpoint are in the project's THROUGHPUT_LOG. We will fill these in here after end-to-end benchmarking of the @2-bit weights.
Release Date
Hugging Face Hub — Preview at https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-2bit. TODO: final release date (MM/DD/YYYY).
Reference(s)
- Project: https://soofi.info
- Base model:
Soofi-Project/Soofi-S-Instruct-Preview - EntQuant paper (ICML 2026): https://icml.cc/virtual/2026/poster/66714
- EntQuant compressor source: https://github.com/merantix-momentum/entquant
- Related models: see Related models.
Model Architecture
Inherits the architecture of the base model unchanged.
- Architecture Type: Transformer-based hybrid Mixture-of-Experts (MoE) with Mamba-2 state-space (SSM) layers and attention layers.
- Network Architecture: Custom Hybrid Mamba-2/MoE (Nemotron-style), designed from scratch — 23 Mamba-2/MoE layers + 6 attention layers; 128 routing experts + 1 shared expert per MoE layer; 6 experts activated per token.
- Number of model parameters: 3.0×10^10 total (30B), with ~3.5B active parameters during inference.
This model was developed from scratch (no parent model); the quantization is applied post-training to the bf16 base.
Computational Load
See the base model card for training compute, energy and emissions. Inference on this quantized variant runs comfortably on a single B200 / H100 (and works on smaller GPUs with reduced KV cache).
Input
- Input Type(s): Text
- Input Format(s): String
- Input Parameters: One-Dimensional (1D)
- Other Properties Related to Input: Chat/ChatML-style messages via the embedded chat template. No system prompt is required (none is injected by default). Context length up to 262144 (256K) — capped at 32768 by default in this image, raise
MAX_MODEL_LENin.envto go higher.
Output
- Output Type(s): Text
- Output Format(s): String
- Output Parameters: One-Dimensional (1D)
- Other Properties Related to Output: Non-thinking by default (no explicit reasoning trace). Supports the model's native tool-calling format (
<tool_call><function=...><parameter=...>...</parameter></function></tool_call>) — vLLM'sqwen3_xmltool-call parser is wired up in the image so OpenAI-styletool_choice: "auto"works out of the box.
Software Integration
Runtime Engine(s):
- vLLM 0.21.0 with the
entquant_codingplugin (bundled in this repo; auto-discovered). - Other engines (HF
transformers,llama.cpp/Ollama) do not load this checkpoint; the on-disk format is vLLM-specific. For HFtransformersuse the bf16 base model; forllama.cppuse the GGUF variant.
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Hopper (H100, H200) or Blackwell (B200, RTX PRO 6000) — for the production W8A8 fp8 path.
- Ampere and newer work via the bf16 linear fallback (set
ENTQUANT_LINEAR_COMPUTE=bf16); some throughput is lost.
Preferred/Supported Operating System(s): Linux.
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment.
Model Version(s)
- This repo:
Soofi-S-Instruct-Preview-EntQuant-2bit— EntQuant, ~2 effective bits, vLLM-only. - Base:
Soofi-S-Instruct-Preview(bf16, unquantized). - Other quantized derivatives: see Related models.
Installation & Usage
Docker / Compose (recommended)
See Quick start above.
Direct vLLM (if you've already installed the plugin)
pip install entquant-coding[gpu] entquant-vllm # once the packages are on PyPI
# or from this repo:
pip install ./entquant-coding ./entquant-vllm --no-deps
ENTQUANT_ANS_GRAPH=1 ENTQUANT_LINEAR_COMPUTE=fp8 ENTQUANT_MOE_SELECTIVE=1 \
vllm serve ./ \
--trust-remote-code \
--max-model-len 32768 \
--served-model-name Soofi-S-Instruct-Preview-EntQuant-2bit \
--enable-auto-tool-choice --tool-call-parser qwen3_xml
OpenAI client (Python)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="Soofi-S-Instruct-Preview-EntQuant-2bit",
messages=[{"role": "user", "content": "Briefly explain the concept of AI sovereignty."}],
)
print(resp.choices[0].message.content)
Training, Testing, and Evaluation Datasets
See the base model card — quantization is post-training and does not change training data. The quantization process itself is data-free.
Dataset Overview (from the base model)
- Total Size: ~2.5×10^13 tokens (25 trillion).
- Languages: English, German (primary); French, Italian, Spanish (limited). English acts as the pivot language.
- Knowledge Cutoff: End of 2025.
- Training Start: April 2026.
Quantization data
This is a data-free post-training quantization — no calibration set is used. EntQuant optimizes scales purely from the weights, with no forward pass through any data. No personal data is used in the quantization process.
Evaluation Dataset
TODO: add accuracy delta vs. the bf16 base on standard benchmarks (MMLU, etc.) once measured. Expected: within EntQuant's published accuracy band at 2-bit effective rate (≤ a few % loss on most benchmarks; see the EntQuant paper for the reference rates on Llama-2-7B and Llama-3-8B).
Inference
- Acceleration Engine: vLLM 0.21.0 with the bundled
entquant_codingplugin (FULL CUDA-graph capture; on-the-fly nvCOMP ANS decode; fp8 W8A8 cutlass matmul; selective expert decode at B=1). - Specific Test Hardware: Validated on NVIDIA B200 (DGX, 183 GiB HBM3e). Also tested on RTX PRO 6000 Blackwell (97 GiB).
Ethical Considerations
The SOOFI consortium believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information, see the Model Card++ subcards below. Please report model quality, risk, security vulnerabilities, or concerns to contact@soofi.info.
The quantization step (EntQuant) is a deterministic, weight-only transformation; it does not introduce new training data, new behaviors, or new alignment properties beyond what the base model already exhibits. Any biases, capability ceilings, and safety considerations from the base model apply unchanged here.
Bias Subcard
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups in model design and testing | Inherited from the base model — see its card. Quantization does not affect bias. |
| Measures taken to mitigate against unwanted bias | See the base model card. |
| Bias Metric (if measured) | See the base model card. |
Explainability Subcard
| Field | Response |
|---|---|
| Intended Task/Domain | General assistant, instruction following, AI-agent/tool use |
| Model Type | Hybrid Mixture-of-Experts (MoE) autoregressive language model, EntQuant-quantized to ~2 effective bits |
| Intended Users | Enterprise developers and researchers |
| Output | Text (String) |
| Describe how the model works | Generates text autoregressively; a router activates 6 of 128 experts per token across hybrid Mamba-2/MoE and attention layers; quantized weights are decoded on the fly via nvCOMP ANS into a fixed GPU scratch and consumed by fused fp8 W8A8 matmuls |
| Technical Limitations | Preview checkpoint; non-primary languages (FR/IT/ES) are limited; ~2-bit-effective quantization may introduce a small accuracy delta vs. the bf16 reference (TODO: quantify); requires vLLM with the bundled plugin (cannot be loaded by stock HF transformers) |
| Verified to have met prescribed quality standards | TODO |
| Performance Metrics | TODO — accuracy delta vs. bf16 base on standard benchmarks |
| Potential Known Risks and Mitigation | May generate incorrect, biased, or unsafe content; apply use-case-specific testing and guardrails before deployment |
| Terms of Use/Licensing | Other (see License/Terms of Use) |
Privacy Subcard
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | TODO — see base model card |
| Personal data used to create this model? | Quantization is data-free; for training-data privacy see the base model card |
| Was consent obtained for any personal data used? | See the base model card |
| How often is dataset reviewed? | See the base model card |
| Was data from user interactions with the AI model used to train the model? | No |
| Is there provenance for all datasets used in training? | See the base model card |
| Applicable Privacy Policy | TODO |
Safety & Security Subcard
| Field | Response |
|---|---|
| Model Application Field(s) | Industrial use; customer service; general-purpose assistant and agent applications |
| Describe the life critical impact (if present) | None intended. Not for use in life-critical or safety-critical decision-making without independent validation |
| Use Case Restrictions | Abide by the applicable license agreement (see License/Terms of Use) |
| Model and dataset restrictions | TODO |
Related models
Base model
Soofi-Project/Soofi-S-Instruct-Preview— bf16 reference; use with HFtransformers.
Variants of this checkpoint
Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-3bit— ~3 effective bits (lower compression, smaller accuracy delta).Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-4bit— ~4 effective bits (least compression, closest to bf16 numerics).Soofi-Project/Soofi-S-Instruct-Preview-FP8— uncompressed FP8 (no EntQuant).Soofi-Project/Soofi-S-Instruct-Preview-GGUF— for llama.cpp / Ollama.
Reasoning variants of the base
Soofi-Project/Soofi-S-Isar-Previewand its EntQuant derivatives.Soofi-Project/Soofi-S-Rhine-Previewand its EntQuant derivatives.
Citation
If you use this model, please cite both the base model and the EntQuant paper:
@misc{soofi_s_instruct_preview,
title = {Soofi-S-Instruct-Preview},
author = {SOOFI Consortium},
year = {2026},
url = {https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview}
}
@misc{soofi_s_instruct_preview_fp8_2bit,
title = {Soofi-S-Instruct-Preview-EntQuant-2bit (EntQuant-quantized)},
author = {SOOFI Consortium},
year = {2026},
url = {https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-2bit}
}
@inproceedings{entquant_icml2026,
title = {EntQuant: Entropy-Optimized Post-Training Quantization},
author = {Merantix Momentum},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
url = {https://icml.cc/virtual/2026/poster/66714}
}
- Downloads last month
- -
Model tree for Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-2bit
Base model
Soofi-Project/Soofi-S-Instruct-Preview