lateon-onnx
This repository hosts the optimized, single-file ONNX representation of the ModernBERT-backed lightonai/LateOn model.
It is designed to run completely PyTorch-free and dependency-free using the intextus-embed runtime library.
Model Metadata
- Backbone: ModernBERT-base (140M parameters)
- Output Dimensions: 128-dimensional late-interaction embeddings
- ONNX File Size: 580 MB (fully self-contained, merged weights)
- Case Sensitivity: Case-sensitive (requires
do_lower_case=False)
Usage
Install the intextus-embed runtime:
pip install intextus-embed
Load the model automatically and run inference (set do_lower_case=False because LateOn is case-sensitive):
from intextus import IntextusEncoder, compute_maxsim
# Automatically downloads and caches the model from Hugging Face
model = IntextusEncoder("lateon", do_lower_case=False)
# Encode queries and documents
query_embeddings = model.encode_queries("What is ultra-low latency?")
doc_embeddings = model.encode_docs("ONNX runtime bypasses the PyTorch layer completely.")
# Compute the MaxSim similarity score via NumPy
score = compute_maxsim(query_embeddings[0], doc_embeddings[0])
print(f"Relevance Score (MaxSim): {score:.4f}")