ng3owb commited on
Commit
f6d0ad0
·
verified ·
1 Parent(s): c44759a

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "word_embedding_dimension": 1024,
3
- "pooling_mode_cls_token": false,
4
- "pooling_mode_mean_tokens": true,
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
 
1
  {
2
  "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
README.md CHANGED
@@ -4,88 +4,297 @@ tags:
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
-
8
  ---
9
 
10
- # {MODEL_NAME}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
13
 
14
- <!--- Describe your model here -->
15
 
16
- ## Usage (Sentence-Transformers)
17
 
18
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ```
21
- pip install -U sentence-transformers
 
 
22
  ```
 
 
 
 
 
23
 
24
- Then you can use the model like this:
25
 
 
 
 
26
  ```python
27
- from sentence_transformers import SentenceTransformer
28
- sentences = ["This is an example sentence", "Each sentence is converted"]
 
 
 
 
 
 
29
 
30
- model = SentenceTransformer('{MODEL_NAME}')
31
- embeddings = model.encode(sentences)
32
- print(embeddings)
 
 
 
 
 
33
  ```
 
 
34
 
35
 
 
 
 
36
 
37
- ## Evaluation Results
38
 
39
- <!--- Describe how your model was evaluated -->
 
 
40
 
41
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
 
42
 
 
 
 
 
43
 
44
- ## Training
45
- The model was trained with the parameters:
46
 
47
- **DataLoader**:
 
 
 
48
 
49
- `torch.utils.data.dataloader.DataLoader` of length 252 with parameters:
50
- ```
51
- {'batch_size': 4, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
52
  ```
53
 
54
- **Loss**:
 
 
55
 
56
- `sentence_transformers.losses.MultipleNegativesSymmetricRankingLoss.MultipleNegativesSymmetricRankingLoss` with parameters:
57
- ```
58
- {'scale': 20.0, 'similarity_fct': 'cos_sim'}
59
- ```
60
 
61
- Parameters of the fit()-Method:
62
- ```
63
- {
64
- "epochs": 4,
65
- "evaluation_steps": 20,
66
- "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
67
- "max_grad_norm": 1,
68
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
69
- "optimizer_params": {
70
- "lr": 2e-05
71
- },
72
- "scheduler": "WarmupLinear",
73
- "steps_per_epoch": null,
74
- "warmup_steps": 10000,
75
- "weight_decay": 0.01
76
- }
77
- ```
78
 
 
 
79
 
80
- ## Full Model Architecture
 
 
 
81
  ```
82
- SentenceTransformer(
83
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
84
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
85
- (2): Normalize()
86
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ```
88
 
89
- ## Citing & Authors
90
 
91
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
+ license: mit
8
  ---
9
 
10
+ For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
11
+
12
+ # BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
13
+
14
+ In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
15
+ - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
16
+ - Multi-Linguality: It can support more than 100 working languages.
17
+ - Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
18
+
19
+
20
+
21
+ **Some suggestions for retrieval pipeline in RAG**
22
+
23
+ We recommend to use the following pipeline: hybrid retrieval + re-ranking.
24
+ - Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
25
+ A classic example: using both embedding retrieval and the BM25 algorithm.
26
+ Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
27
+ This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
28
+ To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
29
+ ) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
30
+
31
+ - As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
32
+ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text.
33
+
34
+
35
+ ## News:
36
+ - 2024/7/1: **We update the MIRACL evaluation results of BGE-M3**. To reproduce the new results, you can refer to: [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr). We have also updated our [paper](https://arxiv.org/pdf/2402.03216) on arXiv.
37
+ <details>
38
+ <summary> Details </summary>
39
+
40
+ The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages.
41
+
42
+ </details>
43
+ - 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
44
+ /hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
45
+ - 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
46
+ - 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
47
+ - 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
48
+ - 2024/2/1: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
49
+
50
+
51
+ ## Specs
52
+
53
+ - Model
54
+
55
+ | Model Name | Dimension | Sequence Length | Introduction |
56
+ |:----:|:---:|:---:|:---:|
57
+ | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
58
+ | [BAAI/bge-m3-unsupervised](https://huggingface.co/BAAI/bge-m3-unsupervised) | 1024 | 8192 | multilingual; contrastive learning from bge-m3-retromae |
59
+ | [BAAI/bge-m3-retromae](https://huggingface.co/BAAI/bge-m3-retromae) | -- | 8192 | multilingual; extend the max_length of [xlm-roberta](https://huggingface.co/FacebookAI/xlm-roberta-large) to 8192 and further pretrained via [retromae](https://github.com/staoxiao/RetroMAE)|
60
+ | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | English model |
61
+ | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | English model |
62
+ | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | 384 | 512 | English model |
63
+
64
+ - Data
65
+
66
+ | Dataset | Introduction |
67
+ |:----------------------------------------------------------:|:-------------------------------------------------:|
68
+ | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages |
69
+ | [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) | Fine-tuning data used by bge-m3 |
70
+
71
+
72
+
73
+ ## FAQ
74
+
75
+ **1. Introduction for different retrieval methods**
76
 
77
+ - Dense retrieval: map the text into a single embedding, e.g., [DPR](https://arxiv.org/abs/2004.04906), [BGE-v1.5](https://github.com/FlagOpen/FlagEmbedding)
78
+ - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
79
+ - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
80
 
 
81
 
82
+ **2. How to use BGE-M3 in other projects?**
83
 
84
+ For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
85
+ The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.
86
 
87
+ For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
88
+ ) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
89
+
90
+
91
+ **3. How to fine-tune bge-M3 model?**
92
+
93
+ You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
94
+ to fine-tune the dense embedding.
95
+
96
+ If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
97
+
98
+
99
+
100
+
101
+
102
+
103
+ ## Usage
104
+
105
+ Install:
106
  ```
107
+ git clone https://github.com/FlagOpen/FlagEmbedding.git
108
+ cd FlagEmbedding
109
+ pip install -e .
110
  ```
111
+ or:
112
+ ```
113
+ pip install -U FlagEmbedding
114
+ ```
115
+
116
 
 
117
 
118
+ ### Generate Embedding for text
119
+
120
+ - Dense Embedding
121
  ```python
122
+ from FlagEmbedding import BGEM3FlagModel
123
+
124
+ model = BGEM3FlagModel('BAAI/bge-m3',
125
+ use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
126
+
127
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
128
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
129
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
130
 
131
+ embeddings_1 = model.encode(sentences_1,
132
+ batch_size=12,
133
+ max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
134
+ )['dense_vecs']
135
+ embeddings_2 = model.encode(sentences_2)['dense_vecs']
136
+ similarity = embeddings_1 @ embeddings_2.T
137
+ print(similarity)
138
+ # [[0.6265, 0.3477], [0.3499, 0.678 ]]
139
  ```
140
+ You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
141
+ Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details.
142
 
143
 
144
+ - Sparse Embedding (Lexical Weight)
145
+ ```python
146
+ from FlagEmbedding import BGEM3FlagModel
147
 
148
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
149
 
150
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
151
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
152
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
153
 
154
+ output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
155
+ output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
156
 
157
+ # you can see the weight for each token:
158
+ print(model.convert_id_to_token(output_1['lexical_weights']))
159
+ # [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092},
160
+ # {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]
161
 
 
 
162
 
163
+ # compute the scores via lexical mathcing
164
+ lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
165
+ print(lexical_scores)
166
+ # 0.19554901123046875
167
 
168
+ print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
169
+ # 0.0
 
170
  ```
171
 
172
+ - Multi-Vector (ColBERT)
173
+ ```python
174
+ from FlagEmbedding import BGEM3FlagModel
175
 
176
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
 
 
 
177
 
178
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
179
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
180
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
 
182
+ output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
183
+ output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
184
 
185
+ print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
186
+ print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
187
+ # 0.7797
188
+ # 0.4620
189
  ```
190
+
191
+
192
+ ### Compute score for text pairs
193
+ Input a list of text pairs, you can get the scores computed by different methods.
194
+ ```python
195
+ from FlagEmbedding import BGEM3FlagModel
196
+
197
+ model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
198
+
199
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
200
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
201
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
202
+
203
+ sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
204
+
205
+ print(model.compute_score(sentence_pairs,
206
+ max_passage_length=128, # a smaller max length leads to a lower latency
207
+ weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
208
+
209
+ # {
210
+ # 'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],
211
+ # 'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],
212
+ # 'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],
213
+ # 'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],
214
+ # 'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
215
+ # }
216
  ```
217
 
 
218
 
219
+
220
+
221
+ ## Evaluation
222
+
223
+ We provide the evaluation script for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR)
224
+
225
+ ### Benchmarks from the open-source community
226
+ ![avatar](./imgs/others.webp)
227
+ The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI).
228
+ For more details, please refer to the [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) and [Github Repo](https://github.com/Yannael/multilingual-embeddings)
229
+
230
+
231
+ ### Our results
232
+ - Multilingual (Miracl dataset)
233
+
234
+ ![avatar](./imgs/miracl.jpg)
235
+
236
+ - Cross-lingual (MKQA dataset)
237
+
238
+ ![avatar](./imgs/mkqa.jpg)
239
+
240
+ - Long Document Retrieval
241
+ - MLDR:
242
+ ![avatar](./imgs/long.jpg)
243
+ Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
244
+ covering 13 languages, including test set, validation set, and training set.
245
+ We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
246
+ Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
247
+ Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
248
+ We believe that this data will be helpful for the open-source community in training document retrieval models.
249
+
250
+ - NarritiveQA:
251
+ ![avatar](./imgs/nqa.jpg)
252
+
253
+ - Comparison with BM25
254
+
255
+ We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
256
+ We tested BM25 using two different tokenizers:
257
+ one using Lucene Analyzer and the other using the same tokenizer as M3 (i.e., the tokenizer of xlm-roberta).
258
+ The results indicate that BM25 remains a competitive baseline,
259
+ especially in long document retrieval.
260
+
261
+ ![avatar](./imgs/bm25.jpg)
262
+
263
+
264
+
265
+ ## Training
266
+ - Self-knowledge Distillation: combining multiple outputs from different
267
+ retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival)
268
+ - Efficient Batching: Improve the efficiency when fine-tuning on long text.
269
+ The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model.
270
+ - MCLS: A simple method to improve the performance on long text without fine-tuning.
271
+ If you have no enough resource to fine-tuning model with long text, the method is useful.
272
+
273
+ Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
274
+
275
+
276
+
277
+
278
+
279
+
280
+ ## Acknowledgement
281
+
282
+ Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
283
+ Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).
284
+
285
+
286
+
287
+ ## Citation
288
+
289
+ If you find this repository useful, please consider giving a star :star: and citation
290
+
291
+ ```
292
+ @misc{bge-m3,
293
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
294
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
295
+ year={2024},
296
+ eprint={2402.03216},
297
+ archivePrefix={arXiv},
298
+ primaryClass={cs.CL}
299
+ }
300
+ ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "ng3owb/financial_embedding_model_balanced",
3
  "architectures": [
4
  "XLMRobertaModel"
5
  ],
@@ -13,7 +13,7 @@
13
  "initializer_range": 0.02,
14
  "intermediate_size": 4096,
15
  "layer_norm_eps": 1e-05,
16
- "max_position_embeddings": 514,
17
  "model_type": "xlm-roberta",
18
  "num_attention_heads": 16,
19
  "num_hidden_layers": 24,
@@ -21,7 +21,7 @@
21
  "pad_token_id": 1,
22
  "position_embedding_type": "absolute",
23
  "torch_dtype": "float32",
24
- "transformers_version": "4.44.2",
25
  "type_vocab_size": 1,
26
  "use_cache": true,
27
  "vocab_size": 250002
 
1
  {
2
+ "_name_or_path": "BAAI/bge-m3",
3
  "architectures": [
4
  "XLMRobertaModel"
5
  ],
 
13
  "initializer_range": 0.02,
14
  "intermediate_size": 4096,
15
  "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 8194,
17
  "model_type": "xlm-roberta",
18
  "num_attention_heads": 16,
19
  "num_hidden_layers": 24,
 
21
  "pad_token_id": 1,
22
  "position_embedding_type": "absolute",
23
  "torch_dtype": "float32",
24
+ "transformers_version": "4.47.0",
25
  "type_vocab_size": 1,
26
  "use_cache": true,
27
  "vocab_size": 250002
config_sentence_transformers.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "__version__": {
3
  "sentence_transformers": "3.3.1",
4
- "transformers": "4.44.2",
5
- "pytorch": "2.4.1+cu121"
6
  },
7
  "prompts": {},
8
  "default_prompt_name": null,
 
1
  {
2
  "__version__": {
3
  "sentence_transformers": "3.3.1",
4
+ "transformers": "4.47.0",
5
+ "pytorch": "2.5.1+cu121"
6
  },
7
  "prompts": {},
8
  "default_prompt_name": null,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7405c39289911a3ac104ec423fe798e86893315bcee38ca2e9fa9f66959fc679
3
- size 2239607176
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:993b2248881724788dcab8c644a91dfd63584b6e5604ff2037cb5541e1e38e7e
3
+ size 2271064456
regressor.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4b8c8871e89cabe6828e637c80eba8dda1bbb7173fd9951f6bf98fa4e994aa71
3
  size 7014200
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea9db9b8646a5d08df84f511732a7b45f9d02a5e0beed259281b7012c9d295f8
3
  size 7014200
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 512,
3
  "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 8192,
3
  "do_lower_case": false
4
  }
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
3
- size 17082987
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e4f7e21bec3fb0044ca0bb2d50eb5d4d8c596273c422baef84466d2c73748b9c
3
+ size 17083053
tokenizer_config.json CHANGED
@@ -47,16 +47,10 @@
47
  "eos_token": "</s>",
48
  "extra_special_tokens": {},
49
  "mask_token": "<mask>",
50
- "max_length": 512,
51
- "model_max_length": 512,
52
- "pad_to_multiple_of": null,
53
  "pad_token": "<pad>",
54
- "pad_token_type_id": 0,
55
- "padding_side": "right",
56
  "sep_token": "</s>",
57
- "stride": 0,
58
  "tokenizer_class": "XLMRobertaTokenizer",
59
- "truncation_side": "right",
60
- "truncation_strategy": "longest_first",
61
  "unk_token": "<unk>"
62
  }
 
47
  "eos_token": "</s>",
48
  "extra_special_tokens": {},
49
  "mask_token": "<mask>",
50
+ "model_max_length": 8192,
 
 
51
  "pad_token": "<pad>",
 
 
52
  "sep_token": "</s>",
53
+ "sp_model_kwargs": {},
54
  "tokenizer_class": "XLMRobertaTokenizer",
 
 
55
  "unk_token": "<unk>"
56
  }