bne-binary-1024

Native 1024-bit binary embedding model. Trained end-to-end with a binary head and tanh contrastive loss β€” not post-hoc binarization.

  • Backbone: prajjwal1/bert-mini (4L Γ— 256d, ~11M params)
  • Output: 1024-dim {-1,+1} binary via Linear(256β†’1024) + LayerNorm + STE
  • Training: tanh contrastive loss on NLI 550k pairs, 3 epochs
STS-B (mean Β±std across 5 seeds) Recall@10 SciFact (mean Β±std across 5 seeds) Memory / 1k vecs Retrieval vs float32
0.7264 Β±0.0018 0.2762 Β±0.0119 125 KB 37–49Γ— faster than float INT8 at 1M vecs (exact search) (FAISS AVX2+POPCNT)

Native binary beats post-hoc binarization by +24% Recall@10, validated across 5 random seeds (p<0.001 bootstrap).

Per-seed breakdown (SciFact Recall@10)
Seed 1024 R@10 2048 R@10
42 0.2925 ← best 1024 0.2761 ← worst 2048
123 0.2875 0.3047
456 0.2728 0.2894
789 0.2619 0.2936
1337 0.2664 0.2992
mean Β± std 0.2762 Β± 0.012 0.2926 Β± 0.010

Seed=42 is a structural outlier (best 1024, worst 2048) that compresses the apparent gap. Excluding it, 4-seed means are 0.272 vs 0.297 β€” a larger and likely significant difference.

Part of binary-native-embeddings-for-CPU-Retrieval Β· Discussion

Why binary?

All methods are exact search β€” no approximation, no recall loss.

Scale Float32 (ms) Float INT8 (ms) Bin-1024 (ms) Bin-2048 (ms) 1024 vs f32 1024 vs INT8
10k 16–50 29–58 0.7–1.5 1.3–2.4 23–33Γ— 19–40Γ—
100k 200–270 290–430 7–10 14–26 24–30Γ— 29–46Γ—
1M 1 800–4 500 2 700–4 700 73–102 145–202 24–47Γ— 37–49Γ—

FAISS AVX2+POPCNT Β· Intel Core Ultra 7 155H Β· 4 benchmark runs Β· 16 queries Β· top-10.

Float32 and INT8 times vary with system background load (both are memory-bandwidth bound). Binary stays stable because its index fits in L3 cache β€” it is compute-bound via POPCNT. The vs-INT8 ratio (37–49Γ—) is the most stable reference.

Float INT8 is consistently slower than float32 β€” IndexScalarQuantizer QT_8bit dequantization overhead exceeds the reduced-bandwidth benefit. Binary POPCNT is the only method that is simultaneously smaller and faster.

IVF-PQ not included β€” approximate search (trades recall for speed). Comparing approximate to exact is not meaningful here.

float uses IndexFlatIP (cosine), binary uses IndexBinaryFlat (Hamming) β€” different metrics, comparable for ranking latency at scale.

POPCNT counts all set bits in a 64-bit word in one CPU cycle. 1024-bit Hamming distance = 16 POPCNT instructions vs 384 multiply-accumulates, plus 6Γ— better cache utilization (128 bytes/vector vs 1 536 bytes).

Usage

import torch
from transformers import BertTokenizer
from huggingface_hub import hf_hub_download
from models.binary_embedder import BinaryEmbedder

tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-mini")
model = BinaryEmbedder(binary_dim=1024)
weights = hf_hub_download("korben99/bne-binary-1024", "binary_embedder_1024.pt")
model.load_state_dict(torch.load(weights, map_location="cpu"))
model.eval()

vecs = model.encode(["hello world"], tokenizer)  # (1, 1024), values in {-1, +1}

Model selection

Model R@10 (5 seeds) Memory/1k FAISS @ 1M
bne-binary-1024 0.2762 Β±0.012 125 KB 73–102 ms (37–49Γ— vs INT8)
bne-binary-2048 0.2926 Β±0.010 250 KB 145–202 ms

The quality difference between 1024 and 2048 is not statistically significant (p=0.159). Pick 1024 for maximum throughput, 2048 for best average quality.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support