LocalDoc
/

LocRet-small

+---
+language:
+- az
+license: apache-2.0
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- retrieval
+- azerbaijani
+- embedding
+library_name: sentence-transformers
+pipeline_tag: sentence-similarity
+datasets:
+- LocalDoc/msmarco-az-reranked
+- LocalDoc/azerbaijani_retriever_corpus-reranked
+- LocalDoc/ldquad_v2_retrieval-reranked
+- LocalDoc/azerbaijani_books_retriever_corpus-reranked
+base_model: intfloat/multilingual-e5-small
+model-index:
+- name: LocRet-small
+  results:
+  - task:
+      type: retrieval
+    dataset:
+      name: AZ-MIRAGE
+      type: custom
+    metrics:
+    - type: mrr@10
+      value: 0.5250
+    - type: ndcg@10
+      value: 0.6162
+    - type: recall@10
+      value: 0.8948
+---
+# LocRet-small — Azerbaijani Retrieval Embedding Model
+**LocRet-small** is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being **4.8× smaller** than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks.
+## Key Results
+### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval)
+| Rank | Model | Parameters | MRR@10 | P@1 | R@5 | R@10 | NDCG@5 | NDCG@10 |
+|:----:|:------|:---------:|:------:|:---:|:---:|:----:|:------:|:-------:|
+| **#1** | **LocRet-small** | **118M** | **0.5250** | **0.3132** | **0.8267** | **0.8948** | **0.5938** | **0.6162** |
+| #2 | BAAI/bge-m3 | 568M | 0.4204 | 0.2310 | 0.6905 | 0.7787 | 0.4791 | 0.5079 |
+| #3 | perplexity-ai/pplx-embed-v1-0.6b | 600M | 0.4117 | 0.2276 | 0.6715 | 0.7605 | 0.4677 | 0.4968 |
+| #4 | intfloat/multilingual-e5-large | 560M | 0.4043 | 0.2264 | 0.6571 | 0.7454 | 0.4584 | 0.4875 |
+| #5 | intfloat/multilingual-e5-base | 278M | 0.3852 | 0.2116 | 0.6353 | 0.7216 | 0.4390 | 0.4672 |
+| #6 | Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.3746 | 0.2135 | 0.6006 | 0.6916 | 0.4218 | 0.4516 |
+| #7 | Qwen/Qwen3-Embedding-4B | 4B | 0.3602 | 0.1869 | 0.6067 | 0.7036 | 0.4119 | 0.4437 |
+| #8 | intfloat/multilingual-e5-small (base) | 118M | 0.3586 | 0.1958 | 0.5927 | 0.6834 | 0.4079 | 0.4375 |
+| #9 | Qwen/Qwen3-Embedding-0.6B | 600M | 0.2951 | 0.1516 | 0.4926 | 0.5956 | 0.3339 | 0.3676 |
+## Usage
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("LocalDoc/LocRet-small")
+queries = ["query: Azərbaycanın paytaxtı hansı şəhərdir?"]
+passages = [
+    "passage: Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
+    "passage: Gəncə Azərbaycanın ikinci böyük şəhəridir.",
+]
+query_embeddings = model.encode(queries)
+passage_embeddings = model.encode(passages)
+similarities = model.similarity(query_embeddings, passage_embeddings)
+print(similarities)
+```
+> **Important:** Always use `"query: "` prefix for queries and `"passage: "` prefix for documents.
+## Training
+### Method
+LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using **listwise KL distillation** combined with a contrastive loss:
+$$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$
+- **Listwise KL divergence**: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05).
+- **In-batch contrastive loss (InfoNCE)**: Provides additional diversity through in-batch negatives on positive passages.
+This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers.
+### Data
+The model was trained on approximately **3.5 million** Azerbaijani query-passage pairs from four datasets:
+| Dataset | Pairs | Domain | Type |
+|:--------|------:|:-------|:-----|
+| [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | ~1.4M | General web QA | Translated EN→AZ |
+| [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | ~1.6M | Books, politics, history | Native AZ |
+| [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | ~189K | News, culture | Native AZ |
+| [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) | ~330K | Wikipedia QA | Native AZ |
+All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds.
+### Hyperparameters
+| Parameter | Value |
+|:----------|:------|
+| Base model | intfloat/multilingual-e5-small |
+| Max sequence length | 512 |
+| Effective batch size | 256 |
+| Learning rate | 5e-5 |
+| Schedule | Linear warmup (5%) + cosine decay |
+| Precision | FP16 |
+| Epochs | 1 |
+| Training time | ~25 hours |
+| Hardware | 4× NVIDIA RTX 5090 (32GB) |
+### Training Insights
+- **Listwise KL distillation outperforms standard contrastive training** (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274).
+- **Retrieval pre-training matters more than language-specific pre-training** for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model.
+- **A mix of translated and native data** prevents catastrophic forgetting while enabling language specialization.
+## Benchmark
+### AZ-MIRAGE
+A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text.
+## Model Details
+| Property | Value |
+|:---------|:------|
+| Architecture | BERT (XLM-RoBERTa) |
+| Parameters | 118M |
+| Embedding dimension | 384 |
+| Max tokens | 512 |
+| Vocabulary | SentencePiece (250K) |
+| Similarity function | Cosine similarity |
+| Language | Azerbaijani (az) |
+| License | Apache 2.0 |
+## Limitations
+- Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model.
+- Requires `"query: "` and `"passage: "` prefixes for optimal performance.
+- Maximum input length is 512 tokens. Longer documents should be chunked.
+## Citation
+```bibtex
+@misc{locret-small-2026,
+  title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model},
+  author={LocalDoc},
+  year={2026},
+  url={https://huggingface.co/LocalDoc/LocRet-small}
+}
+```
+## Acknowledgments
+- Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
+- Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
+- Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research.