Instructions to use Zeolit/lettuce-emb-768d-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zeolit/lettuce-emb-768d-v4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Zeolit/lettuce-emb-768d-v4", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Zeolit/lettuce-emb-768d-v4", trust_remote_code=True) model = AutoModel.from_pretrained("Zeolit/lettuce-emb-768d-v4", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
lettuce-emb-768d-v4
A roleplay-first embedding model for LettuceAI's on-device memory layer.
lettuce-emb-768d-v4 is built for one job: retrieve the right memory from a long, messy roleplay conversation history. v3 failed at this (recall@1 = 0.020). v4 hits recall@1 = 0.924 while keeping general semantic quality intact (STSBenchmark = 0.819).
It also works fine as a general retrieval embedder. It just was not optimized for that as the first priority.
- Backbone:
nomic-ai/nomic-embed-text-v1.5 - Output: 768d native (no Dense projection)
- Matryoshka dims:
64/128/256/512/768 - Context length: 4096 tokens
- Pooling: mean over tokens, L2 normalized
- License: Apache 2.0
Headline numbers
| Metric | v3 | v4 | Change |
|---|---|---|---|
| RP recall@1 | 0.020 | 0.924 | 46.2x |
| RP recall@5 | 0.109 | 0.982 | 9.0x |
| STSBenchmark Spearman | 0.809 | 0.819 | +0.010 |
| Output dim | 512d | 768d native | no bottleneck |
| Matryoshka | no | 5 tiers from one file | yes |
| ONNX | not released | FP32 + INT8 | shipped |
The full release write-up is on the LettuceAI blog.
Files
.
├── config.json
├── configuration_hf_nomic_bert.py
├── model.safetensors # 547 MB, FP32 weights
├── tokenizer.json
├── tokenizer_config.json
├── metrics.json # release checkpoint metrics, full
├── best_release_metrics.json # release checkpoint metrics, summary
└── onnx/
├── model.fp32.onnx # 547.7 MB, server / GPU
└── model.int8.onnx # 138.0 MB, on-device CPU
Both ONNX files return L2-normalized 768d vectors directly. The caller picks the Matryoshka dim by slicing.
Usage
transformers + torch
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
MODEL_ID = "Zeolit/lettuce-emb-768d-v4"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
model.eval()
def embed(texts, dim=768):
enc = tokenizer(texts, padding=True, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
out = model(**enc).last_hidden_state
mask = enc.attention_mask.unsqueeze(-1).float()
pooled = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
pooled = F.normalize(pooled, p=2, dim=1)
sliced = pooled[:, :dim]
return F.normalize(sliced, p=2, dim=1) # re-normalize after slice
vecs = embed(["hello world", "I remember that day"], dim=256)
sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Zeolit/lettuce-emb-768d-v4", trust_remote_code=True)
vecs = model.encode(["hello world", "I remember that day"], normalize_embeddings=True)
ONNX (recommended for production)
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Zeolit/lettuce-emb-768d-v4")
sess = ort.InferenceSession("onnx/model.fp32.onnx", providers=["CPUExecutionProvider"])
enc = tok(["hello world"], padding=True, truncation=True, max_length=4096, return_tensors="np")
inputs = {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
}
vec = sess.run(None, inputs)[0] # already L2-normalized, shape (1, 768)
For 8x smaller deployment, swap to model.int8.onnx. The output API is identical.
Matryoshka slicing
def slice_dim(vec, dim):
sliced = vec[..., :dim]
return sliced / np.linalg.norm(sliced, axis=-1, keepdims=True)
v_64 = slice_dim(vec, 64) # 256 bytes per vector (FP32)
v_128 = slice_dim(vec, 128)
v_256 = slice_dim(vec, 256)
v_512 = slice_dim(vec, 512)
v_768 = vec # already normalized
Matryoshka tradeoff
| Dim | Bytes (FP32) | recall@1 | recall@5 | recall@10 | MRR@10 |
|---|---|---|---|---|---|
| 64d | 256 | 0.424 | 0.648 | 0.698 | 0.523 |
| 128d | 512 | 0.488 | 0.723 | 0.768 | 0.591 |
| 256d | 1,024 | 0.504 | 0.752 | 0.796 | 0.614 |
| 512d | 2,048 | 0.509 | 0.767 | 0.808 | 0.622 |
| 768d | 3,072 | 0.512 | 0.769 | 0.815 | 0.628 |
(Numbers from the 144k-passage extreme retrieval benchmark. Full benchmark in the release post.)
Going from 768d to 64d costs ~17% of recall@1 in exchange for 12x smaller vectors. Even at 64d, v4 is well above v3's 768d performance.
Training
Three-stage curriculum, ~285k pairs/triplets across roleplay/persona, long-form narrative, and general retrieval data, with BGE-M3 hard negatives refreshed per epoch.
| Stage | Seq len | Batch | Negatives | Losses |
|---|---|---|---|---|
| 1 warmup | 512 | 128 pairs | in-batch | MNR |
| 2 main | 2048 | 16 triplets | hard negatives | MNR + Cosine distillation |
| 3 refinement | 4096 | 8 triplets | refreshed hard negatives | MNR + Cosine + MarginMSE + STS replay |
Released checkpoint is best_release (step 34400): the highest-recall checkpoint that still passes the STSBenchmark release floor. It is not the final training step. See the engineering postmortem on the LettuceAI blog for why.
Intended use
- Memory retrieval over multi-turn roleplay / persona conversations (primary).
- General sentence similarity and retrieval over short and long documents.
- On-device embedding via INT8 ONNX for resource-constrained hardware.
Out of scope
- Cross-lingual retrieval. Trained on English data.
- Code retrieval. Not in the training mix.
- Reranking. Use a dedicated cross-encoder for that.
Limitations
- Benchmarks reported here are in-distribution for v4 (sources it saw during training). v3 was tested on the same set so the relative comparison is fair, but absolute generalization on completely held-out corpora may differ.
- Tuned for roleplay-style memory retrieval. On clean-QA benchmarks like MS MARCO, dedicated retrieval models will likely outperform it.
- 4096-token context is real but works best when the embedded passage is genuinely long. Short passages do not need it.
Citation
@misc{lettuceemb_v4_2026,
author = {Zeolit and LettuceAI},
title = {lettuce-emb-768d-v4: a roleplay-first embedding model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Zeolit/lettuce-emb-768d-v4}}
}
Acknowledgments
nomic-ai/nomic-embed-text-v1.5for the backbone.BAAI/bge-m3for hard-negative mining and teacher cosine scores.cross-encoder/ms-marco-MiniLM-L-6-v2for false-negative filtering during data prep.google/gemma-4-26b-a4b-itfor synthetic query generation.- Training data sources:
google/Synthetic-Persona-Chat,nazlicanto/persona-based-chat,kmfoda/booksum,deepmind/narrativeqa, and thesentence-transformersmirrors of HotpotQA, GooAQ, NQ, AllNLI.
- Downloads last month
- 16
Model tree for Zeolit/lettuce-emb-768d-v4
Base model
nomic-ai/nomic-embed-text-v1.5