Try LFM • Docs • LEAP • Discord

LFM2.5-8B-A1B-ONNX

ONNX export of LFM2.5-8B-A1B for cross-platform inference.

LFM2.5-8B-A1B is a Mixture of Experts model with 8B total parameters and about 1B active parameters per token. It uses 32 experts with 4 experts activated per token, combining the efficiency of sparse models with the quality of larger dense models.

Recommended Variants

Precision	Size	Use Case
Q4F16	~4.7GB	Recommended (Q4 MoE + FP16 dense)
FP16	~15.8GB	Higher quality
Q4	~5.2GB	Smallest size
Q8	~30.4GB	Highest-fidelity quantized variant

Note: This model is too large for WebGPU browser inference.

Validation

This export was validated against the local PyTorch reference for LiquidAI/LFM2.5-8B-A1B.

FP32 padded-batch parity passed for both left and right padding, with cosine similarity 1.0000 and top-5 overlap 5/5 at the last valid token for each row.
Q4 decoder and coherence checks passed the repository thresholds. Average coherence similarity: 0.7144.
Q4F16 was runtime-validated on CPUExecutionProvider and matched the same decoder/coherence thresholds as Q4. Average coherence similarity: 0.7145.
Q8 decoder and coherence checks passed, and stayed very close to the PyTorch reference. Average coherence similarity: 0.9975.

Model Files

onnx/
├── model.onnx              # FP32 model graph
├── model.onnx_data*        # FP32 weights
├── model_fp16.onnx         # FP16 model graph
├── model_fp16.onnx_data*   # FP16 weights
├── model_q4.onnx           # Q4 model graph
├── model_q4.onnx_data*     # Q4 weights
├── model_q4f16.onnx        # Q4 MoE experts + FP16 dense (recommended)
├── model_q4f16.onnx_data*  # Q4F16 weights
├── model_q8.onnx           # Q8 model graph
└── model_q8.onnx_data*     # Q8 weights

* Large models split weights across multiple files:
  model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
  All data files must be in the same directory as the .onnx file.

Python

Installation

pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub

Inference

from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoTokenizer
import numpy as np
import onnxruntime

# 1. Load config, tokenizer, and model
model_id = "LiquidAI/LFM2.5-8B-A1B-ONNX"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eos_token_id = config.eos_token_id

filename = "model_q4f16.onnx"  # Options: model.onnx, model_fp16.onnx, model_q4.onnx, model_q4f16.onnx, model_q8.onnx
model_path = snapshot_download(repo_id=model_id, allow_patterns=f"onnx/{filename}*")
session = onnxruntime.InferenceSession(f"{model_path}/onnx/{filename}")
input_names = {inp.name for inp in session.get_inputs()}

# 2. Prepare inputs
prompt = "What is C. elegans?"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="np",
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
batch_size = input_ids.shape[0]

past_cache_values = {}
for inp in session.get_inputs():
    name = inp.name
    shape = inp.shape
    dtype = np.float32 if inp.type == "tensor(float)" else np.float16
    if name.startswith("past_key_values"):
        past_cache_values[name] = np.zeros([batch_size, shape[1], 0, shape[3]], dtype=dtype)
    elif name.startswith("past_conv"):
        past_cache_values[name] = np.zeros([batch_size, shape[1], shape[2]], dtype=dtype)

position_ids = np.arange(input_ids.shape[1], dtype=np.int64).reshape(1, -1)

# 3. Generation loop
max_new_tokens = 256
generated_tokens = np.array([[]], dtype=np.int64)
cur_len = input_ids.shape[1]
for i in range(max_new_tokens):
    if i == 0:
        ids = input_ids
        pos = position_ids
    else:
        ids = generated_tokens[:, -1:]
        pos = np.array([[cur_len - 1]], dtype=np.int64)

    feed = {
        "input_ids": ids,
        "attention_mask": attention_mask,
        **past_cache_values,
    }
    if "position_ids" in input_names:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    logits = outputs[0]
    next_token = logits[:, -1].argmax(-1, keepdims=True)

    generated_tokens = (
        next_token if generated_tokens.shape[1] == 0
        else np.concatenate([generated_tokens, next_token], axis=-1)
    )
    attention_mask = np.concatenate(
        [attention_mask, np.ones_like(next_token, dtype=np.int64)],
        axis=-1,
    )

    output_names = [out.name for out in session.get_outputs()]
    cache_outputs = {
        name: value
        for name, value in zip(output_names[1:], outputs[1:])
    }
    for key in past_cache_values:
        present_key = key.replace("past_key_values", "present").replace("past_conv", "present_conv")
        past_cache_values[key] = cache_outputs[present_key]

    cur_len += 1
    if np.isin(next_token, eos_token_id).any():
        break

    print(tokenizer.decode(next_token[0]), end="", flush=True)
print()

# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])

License

This model is released under the LFM 1.0 License.

Downloads last month: 18

Model tree for LiquidAI/LFM2.5-8B-A1B-ONNX

Base model

LiquidAI/LFM2.5-8B-A1B-Base

Finetuned

LiquidAI/LFM2.5-8B-A1B

Quantized

(26)

this model