LFM2.5-8B-A1B-ONNX
ONNX export of LFM2.5-8B-A1B for cross-platform inference.
LFM2.5-8B-A1B is a Mixture of Experts model with 8B total parameters and about 1B active parameters per token. It uses 32 experts with 4 experts activated per token, combining the efficiency of sparse models with the quality of larger dense models.
Recommended Variants
| Precision | Size | Use Case |
|---|---|---|
| Q4F16 | ~4.7GB | Recommended (Q4 MoE + FP16 dense) |
| FP16 | ~15.8GB | Higher quality |
| Q4 | ~5.2GB | Smallest size |
| Q8 | ~30.4GB | Highest-fidelity quantized variant |
Note: This model is too large for WebGPU browser inference.
Validation
This export was validated against the local PyTorch reference for LiquidAI/LFM2.5-8B-A1B.
- FP32 padded-batch parity passed for both left and right padding, with cosine similarity
1.0000and top-5 overlap5/5at the last valid token for each row. - Q4 decoder and coherence checks passed the repository thresholds. Average coherence similarity:
0.7144. - Q4F16 was runtime-validated on
CPUExecutionProviderand matched the same decoder/coherence thresholds as Q4. Average coherence similarity:0.7145. - Q8 decoder and coherence checks passed, and stayed very close to the PyTorch reference. Average coherence similarity:
0.9975.
Model Files
onnx/
├── model.onnx # FP32 model graph
├── model.onnx_data* # FP32 weights
├── model_fp16.onnx # FP16 model graph
├── model_fp16.onnx_data* # FP16 weights
├── model_q4.onnx # Q4 model graph
├── model_q4.onnx_data* # Q4 weights
├── model_q4f16.onnx # Q4 MoE experts + FP16 dense (recommended)
├── model_q4f16.onnx_data* # Q4F16 weights
├── model_q8.onnx # Q8 model graph
└── model_q8.onnx_data* # Q8 weights
* Large models split weights across multiple files:
model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
All data files must be in the same directory as the .onnx file.
Python
Installation
pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub
Inference
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoTokenizer
import numpy as np
import onnxruntime
# 1. Load config, tokenizer, and model
model_id = "LiquidAI/LFM2.5-8B-A1B-ONNX"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eos_token_id = config.eos_token_id
filename = "model_q4f16.onnx" # Options: model.onnx, model_fp16.onnx, model_q4.onnx, model_q4f16.onnx, model_q8.onnx
model_path = snapshot_download(repo_id=model_id, allow_patterns=f"onnx/{filename}*")
session = onnxruntime.InferenceSession(f"{model_path}/onnx/{filename}")
input_names = {inp.name for inp in session.get_inputs()}
# 2. Prepare inputs
prompt = "What is C. elegans?"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="np",
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
batch_size = input_ids.shape[0]
past_cache_values = {}
for inp in session.get_inputs():
name = inp.name
shape = inp.shape
dtype = np.float32 if inp.type == "tensor(float)" else np.float16
if name.startswith("past_key_values"):
past_cache_values[name] = np.zeros([batch_size, shape[1], 0, shape[3]], dtype=dtype)
elif name.startswith("past_conv"):
past_cache_values[name] = np.zeros([batch_size, shape[1], shape[2]], dtype=dtype)
position_ids = np.arange(input_ids.shape[1], dtype=np.int64).reshape(1, -1)
# 3. Generation loop
max_new_tokens = 256
generated_tokens = np.array([[]], dtype=np.int64)
cur_len = input_ids.shape[1]
for i in range(max_new_tokens):
if i == 0:
ids = input_ids
pos = position_ids
else:
ids = generated_tokens[:, -1:]
pos = np.array([[cur_len - 1]], dtype=np.int64)
feed = {
"input_ids": ids,
"attention_mask": attention_mask,
**past_cache_values,
}
if "position_ids" in input_names:
feed["position_ids"] = pos
outputs = session.run(None, feed)
logits = outputs[0]
next_token = logits[:, -1].argmax(-1, keepdims=True)
generated_tokens = (
next_token if generated_tokens.shape[1] == 0
else np.concatenate([generated_tokens, next_token], axis=-1)
)
attention_mask = np.concatenate(
[attention_mask, np.ones_like(next_token, dtype=np.int64)],
axis=-1,
)
output_names = [out.name for out in session.get_outputs()]
cache_outputs = {
name: value
for name, value in zip(output_names[1:], outputs[1:])
}
for key in past_cache_values:
present_key = key.replace("past_key_values", "present").replace("past_conv", "present_conv")
past_cache_values[key] = cache_outputs[present_key]
cur_len += 1
if np.isin(next_token, eos_token_id).any():
break
print(tokenizer.decode(next_token[0]), end="", flush=True)
print()
# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])
License
This model is released under the LFM 1.0 License.
- Downloads last month
- 18