Byrne-Embed

Byrne-Embed is a compact 85M-parameter sentence-embedding model. It maps text to 768-dimensional unit-norm vectors suitable for semantic similarity, retrieval, clustering, and reranking.

The backbone is a custom SpikeWhale decoder (the "Byrne" line). A mean-pooled representation of its last hidden state is projected to 768 dimensions by a learned head and unit-normalized, so cosine similarity between two embeddings is just a dot product.

Benchmark vs. EmbeddingGemma-300M

We benchmarked Byrne-Embed against Google's EmbeddingGemma-300M on 4,000 held-out sentences spanning educational web text, encyclopedic text, and instruction/chat text. Byrne-Embed's embedding geometry tracks closely with EmbeddingGemma's at roughly 1/3.5 the parameter count:

Metric (Byrne-Embed vs EmbeddingGemma)	Result
Mean per-sentence cosine	0.9415 (median 0.945, p10 0.912)
Sentences within 0.90 cosine	94.7%
Similarity-structure agreement (Pearson)	0.9702
Similarity-structure agreement (Spearman)	0.9599
Per-anchor neighbour-ranking correlation	0.9494
Retrieval top-1 nearest-neighbour agreement	72.8%
Retrieval Recall@10 overlap	78.2%

Reading the numbers. The two most important measures — how closely the two models agree on which sentences are similar — land at Pearson 0.97 / Spearman 0.96: when EmbeddingGemma judges two sentences similar, Byrne-Embed agrees almost identically. 94.7% of all sentences sit within 0.90 cosine. The lower top-1 retrieval number is expected and not a quality gap: in a dense pool of real sentences many neighbours are near-ties (0.88 vs 0.87), so the single #1 slot flips easily between near-duplicates — which is why Recall@10 stays at ~78% and the neighbour-ranking correlation is 0.95. Both models find the same neighbourhood; they just occasionally swap rank 1 and rank 2 among near-identical candidates.

Reproduce these numbers with the bundled run_tests.py (it loads both models and prints the full table).

MTEB English Benchmark — MTEB(eng, v2)

Evaluated with the official mteb library on the full MTEB(eng, v2) suite (41/41 tasks). Raw results are in mteb_results/; machine-readable scores are in the model-index metadata above.

Overall MTEB(eng, v2) mean: 50.79

Category	Mean	Tasks
STS	71.93	9
Classification	70.57	8
PairClassification	74.07	3
Clustering	37.32	8
Reranking	40.48	2
Retrieval	24.64	10
Summarization	22.39	1

STS

Task	Score
BIOSSES	75.56
SICK-R	69.08
STS12	64.88
STS13	72.08
STS14	67.76
STS15	77.13
STS17	83.23
STS22.v2	60.53
STSBenchmark	77.08

Classification

Task	Score
AmazonCounterfactualClassification	80.12
Banking77Classification	74.64
ImdbClassification	60.97
MTOPDomainClassification	92.29
MassiveIntentClassification	63.23
MassiveScenarioClassification	73.05
ToxicConversationsClassification	62.94
TweetSentimentExtractionClassification	57.29

PairClassification

Task	Score
SprintDuplicateQuestions	86.47
TwitterSemEval2015	53.19
TwitterURLCorpus	82.55

Clustering

Task	Score
ArXivHierarchicalClusteringP2P	53.15
ArXivHierarchicalClusteringS2S	50.39
BiorxivClusteringP2P.v2	33.73
MedrxivClusteringP2P.v2	32.70
MedrxivClusteringS2S.v2	29.04
StackExchangeClustering.v2	41.93
StackExchangeClusteringP2P.v2	35.22
TwentyNewsgroupsClustering.v2	22.39

Reranking

Task	Score
AskUbuntuDupQuestions	52.88
MindSmallReranking	28.07

Retrieval

Task	Score
ArguAna	37.67
CQADupstackGamingRetrieval	37.14
CQADupstackUnixRetrieval	23.48
ClimateFEVERHardNegatives	13.60
FEVERHardNegatives	28.70
FiQA2018	11.38
HotpotQAHardNegatives	30.47
SCIDOCS	10.15
TRECCOVID	29.30
Touche2020Retrieval.v3	24.50

Summarization

Task	Score
SummEvalSummarization.v2	22.39

Usage

from byrne_embedder import ByrneEmbedder

enc = ByrneEmbedder(".")                 # load from the model dir
vecs = enc.encode(["The cat sat on the windowsill.",
                   "A feline rested by the window."])   # (2, 768), unit-norm
print(float(vecs[0] @ vecs[1]))          # cosine similarity ~ 0.83

print(enc.similarity("How do I bake bread?",
                     "Photosynthesis converts sunlight to energy."))  # ~ 0.28

encode() returns L2-normalized torch.Tensor rows, so cosine similarity is just a dot product.

Files

File	Purpose
`model.safetensors`, `config.json`	SpikeWhale backbone weights + config
`embed_head.pt`	learned projection head to 768-dim
`tokenizer.json`, `tokenizer_config.json`	byte-level SpikeTokenizer
`byrne_embedder.py`	self-contained loader / `encode()` API
`model_v2.py`, `config.py`, `spike_tokenizer.py`	SpikeWhale architecture + tokenizer code
`run_tests.py`	reproduces the benchmark table

Limitations

English-centric evaluation; non-English performance is untested.
The single residual weak spot observed during evaluation is finance/economics paraphrase retrieval; general semantic similarity is strong.
Custom architecture: load via the bundled byrne_embedder.py (local modeling code — no remote code execution).

Citation

If you use Byrne-Embed, please cite:

@misc{byrne2026byrneembed,
  title        = {Byrne-Embed: A Compact 85M Sentence-Embedding Model},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Quazim0t0/Byrne-Embed}},
}

License

Apache-2.0.

Downloads last month: -

Safetensors

Model size

96.9M params

Tensor type

F32

Space using Quazim0t0/Byrne-Embed 1

Evaluation results

accuracy on MTEB AmazonCounterfactualClassification
test set self-reported

80.120
v_measure on MTEB ArXivHierarchicalClusteringP2P
test set self-reported

53.150
v_measure on MTEB ArXivHierarchicalClusteringS2S
test set self-reported

50.390
ndcg_at_10 on MTEB ArguAna
test set self-reported

37.670
map_at_1000 on MTEB AskUbuntuDupQuestions
test set self-reported

52.880
cosine_spearman on MTEB BIOSSES
test set self-reported

75.560
accuracy on MTEB Banking77Classification
test set self-reported

74.640
v_measure on MTEB BiorxivClusteringP2P.v2
test set self-reported

33.730