vrashad commited on
Commit
10fa932
·
verified ·
1 Parent(s): 2ee5191

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -0
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - az
4
+ license: apache-2.0
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - retrieval
10
+ - azerbaijani
11
+ - embedding
12
+ library_name: sentence-transformers
13
+ pipeline_tag: sentence-similarity
14
+ datasets:
15
+ - LocalDoc/msmarco-az-reranked
16
+ - LocalDoc/azerbaijani_retriever_corpus-reranked
17
+ - LocalDoc/ldquad_v2_retrieval-reranked
18
+ - LocalDoc/azerbaijani_books_retriever_corpus-reranked
19
+ base_model: intfloat/multilingual-e5-small
20
+ model-index:
21
+ - name: LocRet-small
22
+ results:
23
+ - task:
24
+ type: retrieval
25
+ dataset:
26
+ name: AZ-MIRAGE
27
+ type: custom
28
+ metrics:
29
+ - type: mrr@10
30
+ value: 0.5250
31
+ - type: ndcg@10
32
+ value: 0.6162
33
+ - type: recall@10
34
+ value: 0.8948
35
+ ---
36
+
37
+ # LocRet-small — Azerbaijani Retrieval Embedding Model
38
+
39
+ **LocRet-small** is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being **4.8× smaller** than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks.
40
+
41
+ ## Key Results
42
+
43
+ ### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval)
44
+
45
+ | Rank | Model | Parameters | MRR@10 | P@1 | R@5 | R@10 | NDCG@5 | NDCG@10 |
46
+ |:----:|:------|:---------:|:------:|:---:|:---:|:----:|:------:|:-------:|
47
+ | **#1** | **LocRet-small** | **118M** | **0.5250** | **0.3132** | **0.8267** | **0.8948** | **0.5938** | **0.6162** |
48
+ | #2 | BAAI/bge-m3 | 568M | 0.4204 | 0.2310 | 0.6905 | 0.7787 | 0.4791 | 0.5079 |
49
+ | #3 | perplexity-ai/pplx-embed-v1-0.6b | 600M | 0.4117 | 0.2276 | 0.6715 | 0.7605 | 0.4677 | 0.4968 |
50
+ | #4 | intfloat/multilingual-e5-large | 560M | 0.4043 | 0.2264 | 0.6571 | 0.7454 | 0.4584 | 0.4875 |
51
+ | #5 | intfloat/multilingual-e5-base | 278M | 0.3852 | 0.2116 | 0.6353 | 0.7216 | 0.4390 | 0.4672 |
52
+ | #6 | Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.3746 | 0.2135 | 0.6006 | 0.6916 | 0.4218 | 0.4516 |
53
+ | #7 | Qwen/Qwen3-Embedding-4B | 4B | 0.3602 | 0.1869 | 0.6067 | 0.7036 | 0.4119 | 0.4437 |
54
+ | #8 | intfloat/multilingual-e5-small (base) | 118M | 0.3586 | 0.1958 | 0.5927 | 0.6834 | 0.4079 | 0.4375 |
55
+ | #9 | Qwen/Qwen3-Embedding-0.6B | 600M | 0.2951 | 0.1516 | 0.4926 | 0.5956 | 0.3339 | 0.3676 |
56
+
57
+
58
+ ## Usage
59
+
60
+ ```python
61
+ from sentence_transformers import SentenceTransformer
62
+
63
+ model = SentenceTransformer("LocalDoc/LocRet-small")
64
+
65
+ queries = ["query: Azərbaycanın paytaxtı hansı şəhərdir?"]
66
+ passages = [
67
+ "passage: Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
68
+ "passage: Gəncə Azərbaycanın ikinci böyük şəhəridir.",
69
+ ]
70
+
71
+ query_embeddings = model.encode(queries)
72
+ passage_embeddings = model.encode(passages)
73
+
74
+ similarities = model.similarity(query_embeddings, passage_embeddings)
75
+ print(similarities)
76
+ ```
77
+
78
+ > **Important:** Always use `"query: "` prefix for queries and `"passage: "` prefix for documents.
79
+
80
+ ## Training
81
+
82
+ ### Method
83
+
84
+ LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using **listwise KL distillation** combined with a contrastive loss:
85
+
86
+ $$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$
87
+
88
+ - **Listwise KL divergence**: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05).
89
+ - **In-batch contrastive loss (InfoNCE)**: Provides additional diversity through in-batch negatives on positive passages.
90
+
91
+ This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers.
92
+
93
+ ### Data
94
+
95
+ The model was trained on approximately **3.5 million** Azerbaijani query-passage pairs from four datasets:
96
+
97
+ | Dataset | Pairs | Domain | Type |
98
+ |:--------|------:|:-------|:-----|
99
+ | [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | ~1.4M | General web QA | Translated EN→AZ |
100
+ | [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | ~1.6M | Books, politics, history | Native AZ |
101
+ | [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | ~189K | News, culture | Native AZ |
102
+ | [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) | ~330K | Wikipedia QA | Native AZ |
103
+
104
+ All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds.
105
+
106
+ ### Hyperparameters
107
+
108
+ | Parameter | Value |
109
+ |:----------|:------|
110
+ | Base model | intfloat/multilingual-e5-small |
111
+ | Max sequence length | 512 |
112
+ | Effective batch size | 256 |
113
+ | Learning rate | 5e-5 |
114
+ | Schedule | Linear warmup (5%) + cosine decay |
115
+ | Precision | FP16 |
116
+ | Epochs | 1 |
117
+ | Training time | ~25 hours |
118
+ | Hardware | 4× NVIDIA RTX 5090 (32GB) |
119
+
120
+ ### Training Insights
121
+
122
+ - **Listwise KL distillation outperforms standard contrastive training** (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274).
123
+ - **Retrieval pre-training matters more than language-specific pre-training** for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model.
124
+ - **A mix of translated and native data** prevents catastrophic forgetting while enabling language specialization.
125
+
126
+ ## Benchmark
127
+
128
+ ### AZ-MIRAGE
129
+
130
+ A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text.
131
+
132
+ ## Model Details
133
+
134
+ | Property | Value |
135
+ |:---------|:------|
136
+ | Architecture | BERT (XLM-RoBERTa) |
137
+ | Parameters | 118M |
138
+ | Embedding dimension | 384 |
139
+ | Max tokens | 512 |
140
+ | Vocabulary | SentencePiece (250K) |
141
+ | Similarity function | Cosine similarity |
142
+ | Language | Azerbaijani (az) |
143
+ | License | Apache 2.0 |
144
+
145
+ ## Limitations
146
+
147
+ - Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model.
148
+ - Requires `"query: "` and `"passage: "` prefixes for optimal performance.
149
+ - Maximum input length is 512 tokens. Longer documents should be chunked.
150
+
151
+ ## Citation
152
+
153
+ ```bibtex
154
+ @misc{locret-small-2026,
155
+ title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model},
156
+ author={LocalDoc},
157
+ year={2026},
158
+ url={https://huggingface.co/LocalDoc/LocRet-small}
159
+ }
160
+ ```
161
+
162
+ ## Acknowledgments
163
+
164
+ - Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
165
+ - Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
166
+ - Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research.