triBne-e5-large — Banglish / Bangla / English

A sentence-embedding model fine-tuned from intfloat/multilingual-e5-large for robust semantic retrieval across Bangla (Bengali script), English, and Banglish (romanized Bengali) — including resilience to the heavy spelling variation that romanized Bengali exhibits (e.g. bhalobashi ↔ valobashi).

The base model collapses spelling variants and cross-script pairs to near-uniform similarity (on Banglish/cross-script pairs its positive–negative cosine margin is negative — it cannot tell a true positive from a hard negative). This fine-tune restores a large positive/negative margin and lifts retrieval sharply, at 1024-dim for higher capacity than the e5-small sibling.

Base model: intfloat/multilingual-e5-large (XLM-RoBERTa-large backbone, ~560M params, 1024-dim)
Method: LoRA contrastive fine-tuning, adapter merged into the backbone
Embedding dim: 1024 · Max sequence length: 128 · Similarity: cosine
Pooling: mean pooling + L2 normalization

Usage

The model was fine-tuned on raw text pairs without the e5 query: / passage: prefixes — so you do not need them. Just encode raw strings.

sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("istiaqfuad/triBne-e5-large")

sentences = [
    "ami tomake bhalobashi",      # Banglish
    "ami tomake valobashi",       # Banglish spelling variant
    "আমি তোমাকে ভালোবাসি",         # Bangla
    "I love you",                 # English
]
emb = model.encode(sentences, normalize_embeddings=True)
print(model.similarity(emb, emb))

transformers (manual mean pooling)

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained("istiaqfuad/triBne-e5-large")
model = AutoModel.from_pretrained("istiaqfuad/triBne-e5-large")

def encode(texts):
    batch = tok(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
    with torch.no_grad():
        out = model(**batch)
    mask = batch["attention_mask"].unsqueeze(-1).float()
    emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1)   # mean pooling
    return F.normalize(emb, p=2, dim=1)

emb = encode(["ami tomake bhalobashi", "আমি তোমাকে ভালোবাসি"])
print((emb[0] @ emb[1]).item())

Evaluation

Evaluated on a held-out test set across three retrieval tasks — banglish_spelling (romanized spelling-variant matching), cross_script (Bangla ↔ Banglish), and en_bn (English ↔ Bangla) — 1,000 anchor/positive pairs per task, ranked over the in-task pool. Compared against the base model and the e5-small sibling fine-tune.

Overall retrieval (mean over the three tasks):

Model	MRR@10	Recall@1	Recall@5	Recall@10
This model (e5-large, fine-tuned)	0.959	0.944	0.975	0.982
triBne-e5-small (fine-tuned)	0.917	0.889	0.950	0.964
Multilingual E5-large (base)	0.685	0.630	0.759	0.797

Per-task MRR@10:

Task	This model	triBne-e5-small	Base e5-large
banglish_spelling	0.948	0.878	0.647
cross_script (bn↔banglish)	0.996	0.989	0.607
en_bn	0.931	0.883	0.800

This model wins every task, and improves on the e5-small sibling across the board.

Positive/negative cosine margin (mean cos(anchor, positive) − mean cos(anchor, hard negative); higher = better separation). The base model has a negative margin on the Banglish tasks — it cannot separate positives from hard negatives; the fine-tune fixes this:

Task	This model	Base e5-large
banglish_spelling	+0.422	−0.022
cross_script	+0.475	−0.058
en_bn	+0.704	+0.127

Training

Objective: MultipleNegativesRankingLoss (in-batch negatives), scale 20.0
PEFT: LoRA — r=32, alpha=64, dropout=0.1, targets query,key,value,dense, task_type=FEATURE_EXTRACTION (adapter merged into the backbone for this release; 14.2M trainable / 574M params ≈ 2.5%)
Epochs: 3 (~15.3k steps) · LR: 3e-5 · Warmup: 500 steps · Optimizer: adamw_torch
Batch: 128 · fp16 · Max sequence length: 128 · final train loss ≈ 0.16
Hardware: single NVIDIA RTX PRO 6000 (Blackwell)
Data: istiaqfuad/bangla-english-banglish-pairs — ~2.4M contrastive pairs (LLM-generated Banglish spelling variants + cross-script pairs, plus OPUS-100 English↔Bangla), deduplicated and interleaved 80% Banglish / 20% English–Bangla.

Same recipe as istiaqfuad/triBne-e5-small, scaled to the e5-large backbone.

Intended use & limitations

Use for: semantic search / retrieval, clustering, and similarity over mixed Bangla / English / romanized-Bengali text — especially noisy, user-generated romanized Bengali with inconsistent spelling. Choose this over the e5-small sibling when retrieval quality matters more than footprint (1024-dim, ~560M params).
Limitations: trained primarily on short text (≤128 tokens); longer inputs are truncated. Banglish training pairs are partly LLM-generated and may carry their biases. Not built for classification or generation.

Citation

If you use this model, please cite the base model (Wang et al., Multilingual E5) and this fine-tune:

@misc{tribne-e5-large,
  title  = {triBne-e5-large: multilingual-e5-large fine-tuned for Banglish/Bangla/English retrieval},
  author = {Istiaqur Rahman Fuad},
  year   = {2026},
  url    = {https://huggingface.co/istiaqfuad/triBne-e5-large}
}

This model is released under the MIT license, following the base model.

Downloads last month: 37

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for istiaqfuad/triBne-e5-large

Base model

intfloat/multilingual-e5-large

Adapter

(4)

this model

Dataset used to train istiaqfuad/triBne-e5-large

Paper for istiaqfuad/triBne-e5-large

Multilingual E5 Text Embeddings: A Technical Report

Paper • 2402.05672 • Published Feb 8, 2024 • 22