Byrne-Embed

Byrne-Embed is a compact 85M-parameter sentence-embedding model. It maps text to 768-dimensional unit-norm vectors suitable for semantic similarity, retrieval, clustering, and reranking.

The backbone is a custom SpikeWhale decoder (the "Byrne" line). A mean-pooled representation of its last hidden state is projected to 768 dimensions by a learned head and unit-normalized, so cosine similarity between two embeddings is just a dot product.

Benchmark vs. EmbeddingGemma-300M

We benchmarked Byrne-Embed against Google's EmbeddingGemma-300M on 4,000 held-out sentences spanning educational web text, encyclopedic text, and instruction/chat text. Byrne-Embed's embedding geometry tracks closely with EmbeddingGemma's at roughly 1/3.5 the parameter count:

Metric (Byrne-Embed vs EmbeddingGemma) Result
Mean per-sentence cosine 0.9415 (median 0.945, p10 0.912)
Sentences within 0.90 cosine 94.7%
Similarity-structure agreement (Pearson) 0.9702
Similarity-structure agreement (Spearman) 0.9599
Per-anchor neighbour-ranking correlation 0.9494
Retrieval top-1 nearest-neighbour agreement 72.8%
Retrieval Recall@10 overlap 78.2%

Reading the numbers. The two most important measures — how closely the two models agree on which sentences are similar — land at Pearson 0.97 / Spearman 0.96: when EmbeddingGemma judges two sentences similar, Byrne-Embed agrees almost identically. 94.7% of all sentences sit within 0.90 cosine. The lower top-1 retrieval number is expected and not a quality gap: in a dense pool of real sentences many neighbours are near-ties (0.88 vs 0.87), so the single #1 slot flips easily between near-duplicates — which is why Recall@10 stays at ~78% and the neighbour-ranking correlation is 0.95. Both models find the same neighbourhood; they just occasionally swap rank 1 and rank 2 among near-identical candidates.

Reproduce these numbers with the bundled run_tests.py (it loads both models and prints the full table).

MTEB English Benchmark — MTEB(eng, v2)

Evaluated with the official mteb library on the full MTEB(eng, v2) suite (41/41 tasks). Raw results are in mteb_results/; machine-readable scores are in the model-index metadata above.

Overall MTEB(eng, v2) mean: 50.79

Category Mean Tasks
STS 71.93 9
Classification 70.57 8
PairClassification 74.07 3
Clustering 37.32 8
Reranking 40.48 2
Retrieval 24.64 10
Summarization 22.39 1

STS

Task Score
BIOSSES 75.56
SICK-R 69.08
STS12 64.88
STS13 72.08
STS14 67.76
STS15 77.13
STS17 83.23
STS22.v2 60.53
STSBenchmark 77.08

Classification

Task Score
AmazonCounterfactualClassification 80.12
Banking77Classification 74.64
ImdbClassification 60.97
MTOPDomainClassification 92.29
MassiveIntentClassification 63.23
MassiveScenarioClassification 73.05
ToxicConversationsClassification 62.94
TweetSentimentExtractionClassification 57.29

PairClassification

Task Score
SprintDuplicateQuestions 86.47
TwitterSemEval2015 53.19
TwitterURLCorpus 82.55

Clustering

Task Score
ArXivHierarchicalClusteringP2P 53.15
ArXivHierarchicalClusteringS2S 50.39
BiorxivClusteringP2P.v2 33.73
MedrxivClusteringP2P.v2 32.70
MedrxivClusteringS2S.v2 29.04
StackExchangeClustering.v2 41.93
StackExchangeClusteringP2P.v2 35.22
TwentyNewsgroupsClustering.v2 22.39

Reranking

Task Score
AskUbuntuDupQuestions 52.88
MindSmallReranking 28.07

Retrieval

Task Score
ArguAna 37.67
CQADupstackGamingRetrieval 37.14
CQADupstackUnixRetrieval 23.48
ClimateFEVERHardNegatives 13.60
FEVERHardNegatives 28.70
FiQA2018 11.38
HotpotQAHardNegatives 30.47
SCIDOCS 10.15
TRECCOVID 29.30
Touche2020Retrieval.v3 24.50

Summarization

Task Score
SummEvalSummarization.v2 22.39

Usage

from byrne_embedder import ByrneEmbedder

enc = ByrneEmbedder(".")                 # load from the model dir
vecs = enc.encode(["The cat sat on the windowsill.",
                   "A feline rested by the window."])   # (2, 768), unit-norm
print(float(vecs[0] @ vecs[1]))          # cosine similarity ~ 0.83

print(enc.similarity("How do I bake bread?",
                     "Photosynthesis converts sunlight to energy."))  # ~ 0.28

encode() returns L2-normalized torch.Tensor rows, so cosine similarity is just a dot product.

Files

File Purpose
model.safetensors, config.json SpikeWhale backbone weights + config
embed_head.pt learned projection head to 768-dim
tokenizer.json, tokenizer_config.json byte-level SpikeTokenizer
byrne_embedder.py self-contained loader / encode() API
model_v2.py, config.py, spike_tokenizer.py SpikeWhale architecture + tokenizer code
run_tests.py reproduces the benchmark table

Limitations

  • English-centric evaluation; non-English performance is untested.
  • The single residual weak spot observed during evaluation is finance/economics paraphrase retrieval; general semantic similarity is strong.
  • Custom architecture: load via the bundled byrne_embedder.py (local modeling code — no remote code execution).

Citation

If you use Byrne-Embed, please cite:

@misc{byrne2026byrneembed,
  title        = {Byrne-Embed: A Compact 85M Sentence-Embedding Model},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Quazim0t0/Byrne-Embed}},
}

License

Apache-2.0.

Downloads last month
-
Safetensors
Model size
96.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Quazim0t0/Byrne-Embed 1

Evaluation results