vrashad commited on
Commit
a87320c
·
verified ·
1 Parent(s): ecd8229

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +240 -3
README.md CHANGED
@@ -1,3 +1,240 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - az
4
+ - en
5
+ license: apache-2.0
6
+ library_name: pytorch
7
+ tags:
8
+ - colbert
9
+ - late-interaction
10
+ - retrieval
11
+ - information-retrieval
12
+ - azerbaijani
13
+ - multilingual
14
+ base_model: LocalDoc/mmBERT-base-en-az
15
+ pipeline_tag: feature-extraction
16
+ ---
17
+
18
+ # ColBERT-AZ
19
+
20
+ A late-interaction retrieval model for Azerbaijani built on top of [mmBERT-base-en-az](https://huggingface.co/LocalDoc/mmBERT-base-en-az). Trained via cross-encoder distillation from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) on a mix of native Azerbaijani and translated retrieval data.
21
+
22
+ ColBERT-AZ uses **late interaction** (token-level MaxSim scoring) rather than dense single-vector retrieval, providing higher precision in retrieval compared to bi-encoder models of similar or larger size.
23
+
24
+ ## Model Details
25
+
26
+ | Property | Value |
27
+ |----------|-------|
28
+ | **Parameters** | 165M |
29
+ | **Embedding dim** | 128 (per token) |
30
+ | **Backbone** | mmBERT-base-en-az (ModernBERT) |
31
+ | **Architecture** | Late interaction (ColBERT) |
32
+ | **Query max length** | 32 tokens |
33
+ | **Document max length** | 256 tokens |
34
+ | **Languages** | Azerbaijani, English |
35
+ | **Training epochs** | 1 |
36
+
37
+ ## Training
38
+
39
+ ### Data
40
+
41
+ ColBERT-AZ was trained on **3 million triplets** sampled from a weighted mix of four reranked datasets:
42
+
43
+ | Source | Weight | Type |
44
+ |--------|--------|------|
45
+ | [LocalDoc/msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | 50% | Translated web search |
46
+ | [LocalDoc/azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | 25% | Native literature |
47
+ | [LocalDoc/azerbaijan_legislation_queries_passages](https://huggingface.co/datasets/LocalDoc/azerbaijan_legislation_queries_passages) | 15% | Native legal |
48
+ | [LocalDoc/azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | 10% | Native general |
49
+
50
+ All datasets include reranker scores from `bge-reranker-v2-m3`, used as teacher signal for knowledge distillation.
51
+
52
+ ### Recipe
53
+
54
+ | Hyperparameter | Value |
55
+ |----------------|-------|
56
+ | Optimizer | AdamW |
57
+ | Learning rate | 1e-6 |
58
+ | Weight decay | 0.01 |
59
+ | Warmup ratio | 0.10 |
60
+ | Schedule | Cosine |
61
+ | Batch size | 16 (effective 32 via gradient accumulation) |
62
+ | Negatives per query (K) | 8 |
63
+ | False negative filter threshold | 0.9 × pos_score |
64
+ | Distillation alpha (KL weight) | 0.7 |
65
+ | Contrastive temperature | 0.05 |
66
+ | Teacher temperature | 1.0 |
67
+ | Mixed precision | BF16 |
68
+ | Epochs | 1 |
69
+ | Hardware | NVIDIA RTX 5090 (32GB) |
70
+
71
+ ### Loss
72
+
73
+ Combined KL distillation + InfoNCE:
74
+
75
+ ```
76
+ L = α × KL(softmax(student_scores) || softmax(teacher_scores)) + (1 − α) × InfoNCE
77
+ ```
78
+
79
+ where `α = 0.7` and student scores are computed via MaxSim over [pos, neg_1, ..., neg_K].
80
+
81
+ ## Evaluation
82
+
83
+ ### Held-out validation
84
+
85
+ Evaluated on 4,500 held-out triplets (1,500 per native source). Each query is ranked among 1 positive and 8 hard negatives.
86
+
87
+ | Source | R@1 | R@3 | MRR | NDCG@10 |
88
+ |--------|-----|-----|-----|---------|
89
+ | Books | 0.5387 | 0.7693 | 0.6821 | 0.7584 |
90
+ | Legislation | 0.6633 | 0.8433 | 0.7679 | 0.8234 |
91
+ | Retriever (general) | 0.8340 | 0.9327 | 0.8901 | 0.9167 |
92
+ | **Macro average** | **0.6787** | **0.8484** | **0.7800** | **0.8328** |
93
+
94
+ ### AZ-MIRAGE benchmark
95
+
96
+ Evaluated on the [AZ-MIRAGE](https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) retrieval benchmark (7,373 queries, 40,448 document pool):
97
+
98
+ | Metric | Score |
99
+ |--------|-------|
100
+ | P@1 | 0.3058 |
101
+ | R@5 | 0.7518 |
102
+ | R@10 | 0.8054 |
103
+ | NDCG@5 | 0.5528 |
104
+ | NDCG@10 | 0.5704 |
105
+ | MRR@10 | 0.4930 |
106
+ | F1@10 | 0.1464 |
107
+
108
+ Comparison with bi-encoder models on AZ-MIRAGE:
109
+
110
+ | Model | Params | NDCG@10 | MRR@10 | P@1 |
111
+ |-------|--------|---------|--------|-----|
112
+ | **ColBERT-AZ (this model)** | **165M** | **0.5704** | **0.4930** | **0.3058** |
113
+ | BAAI/bge-m3 | 568M | 0.5079 | 0.4204 | 0.2310 |
114
+ | google/gemini-embedding-2-preview | API | 0.5309 | 0.4372 | 0.2338 |
115
+ | perplexity/pplx-embed-v1-4b | API | 0.5225 | 0.4361 | 0.2470 |
116
+ | microsoft/harrier-oss-v1-0.6b | 600M | 0.5168 | 0.4321 | 0.2535 |
117
+ | intfloat/multilingual-e5-large | 560M | 0.4875 | 0.4043 | 0.2264 |
118
+ | intfloat/multilingual-e5-base | 278M | 0.4672 | 0.3852 | 0.2116 |
119
+ | sentence-transformers/LaBSE | 471M | 0.2472 | 0.1944 | 0.0943 |
120
+
121
+ ## Usage
122
+
123
+ This repository contains:
124
+
125
+ - `config.json`, `model.safetensors`, `tokenizer.*` — encoder backbone (mmBERT-base-en-az)
126
+ - `projection.pt` — ColBERT linear projection layer (768 → 128, no bias)
127
+
128
+ ColBERT requires both the backbone and the projection layer for correct inference.
129
+
130
+ ### Loading the model
131
+
132
+ ```python
133
+ import torch
134
+ import torch.nn as nn
135
+ import torch.nn.functional as F
136
+ from transformers import AutoTokenizer, AutoModel
137
+
138
+ class ColBERT(nn.Module):
139
+ def __init__(self, model_name: str, embedding_dim: int = 128):
140
+ super().__init__()
141
+ self.backbone = AutoModel.from_pretrained(model_name)
142
+ self.projection = nn.Linear(self.backbone.config.hidden_size, embedding_dim, bias=False)
143
+
144
+ @torch.no_grad()
145
+ def encode(self, input_ids, attention_mask, keep_mask=None):
146
+ out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
147
+ emb = self.projection(out.last_hidden_state)
148
+ emb = F.normalize(emb, p=2, dim=-1)
149
+ eff_mask = attention_mask if keep_mask is None else attention_mask * keep_mask
150
+ emb = emb * eff_mask.unsqueeze(-1).float()
151
+ return emb, eff_mask
152
+
153
+ # Load
154
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
155
+ tokenizer = AutoTokenizer.from_pretrained("LocalDoc/colbert-az")
156
+
157
+ # Add ColBERT special tokens
158
+ tokenizer.add_special_tokens({"additional_special_tokens": ["[Q]", "[D]"]})
159
+
160
+ model = ColBERT("LocalDoc/colbert-az")
161
+ model.backbone.resize_token_embeddings(len(tokenizer))
162
+
163
+ # Load projection layer
164
+ from huggingface_hub import hf_hub_download
165
+ proj_path = hf_hub_download(repo_id="LocalDoc/colbert-az", filename="projection.pt")
166
+ model.projection.load_state_dict(torch.load(proj_path, map_location="cpu"))
167
+
168
+ model = model.to(device).eval()
169
+ ```
170
+
171
+ ### Encoding queries and documents
172
+
173
+ ```python
174
+ # Tokenization helpers
175
+ def tokenize_query(text: str, max_len: int = 32):
176
+ text = f"[Q] {text}"
177
+ enc = tokenizer(text, padding="max_length", truncation=True,
178
+ max_length=max_len, return_tensors="pt")
179
+ # ColBERT trick: replace pad with mask for query expansion
180
+ pad_mask = enc["input_ids"] == tokenizer.pad_token_id
181
+ enc["input_ids"][pad_mask] = tokenizer.mask_token_id
182
+ enc["attention_mask"] = torch.ones_like(enc["input_ids"])
183
+ return enc
184
+
185
+ def tokenize_doc(text: str, max_len: int = 256):
186
+ text = f"[D] {text}"
187
+ return tokenizer(text, padding=True, truncation=True,
188
+ max_length=max_len, return_tensors="pt")
189
+
190
+ # Compute MaxSim score between query and a single document
191
+ def maxsim_score(query: str, document: str) -> float:
192
+ q_enc = {k: v.to(device) for k, v in tokenize_query(query).items()}
193
+ d_enc = {k: v.to(device) for k, v in tokenize_doc(document).items()}
194
+
195
+ q_emb, _ = model.encode(q_enc["input_ids"], q_enc["attention_mask"])
196
+ d_emb, d_mask = model.encode(d_enc["input_ids"], d_enc["attention_mask"])
197
+
198
+ # MaxSim: for each query token, take max similarity over doc tokens, then sum
199
+ sim = torch.einsum("qld,bnd->qlbn", q_emb, d_emb)
200
+ sim = sim.masked_fill(~d_mask.unsqueeze(0).unsqueeze(0).bool(), float("-inf"))
201
+ max_per_token, _ = sim.max(dim=-1)
202
+ score = max_per_token.sum(dim=1).item()
203
+ return score
204
+
205
+ # Example
206
+ query = "Azərbaycan mədəniyyətinin tarixi"
207
+ doc = "Azərbaycan mədəniyyəti zəngin tarixə malikdir və qədim dövrlərdən başlayaraq inkişaf edib."
208
+ print(f"Score: {maxsim_score(query, doc):.4f}")
209
+ ```
210
+
211
+ ### Recommended retrieval pipeline
212
+
213
+ For production retrieval, use ColBERT-AZ with a proper indexing library that supports late interaction:
214
+
215
+ - [PLAID](https://github.com/stanford-futuredata/ColBERT) — official ColBERT indexing
216
+ - [pylate](https://github.com/lightonai/pylate) — modern ColBERT framework
217
+
218
+ These libraries handle efficient indexing, scalable MaxSim retrieval, and quantization for production deployment.
219
+
220
+ ## Citation
221
+
222
+ ```bibtex
223
+ @misc{colbert-az-2026,
224
+ title = {ColBERT-AZ: Late-Interaction Retrieval for Azerbaijani},
225
+ author = {LocalDoc},
226
+ year = {2026},
227
+ url = {https://huggingface.co/LocalDoc/colbert-az}
228
+ }
229
+ ```
230
+
231
+ ## License
232
+
233
+ Apache 2.0
234
+
235
+ ## Acknowledgements
236
+
237
+ - Built on top of [mmBERT-base-en-az](https://huggingface.co/LocalDoc/mmBERT-base-en-az)
238
+ - Distilled from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
239
+ - Original ColBERT architecture: Khattab & Zaharia (2020), [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)
240
+ - ColBERTv2 distillation methodology: Santhanam et al. (2022), [ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction](https://arxiv.org/abs/2112.01488)