izhx commited on
Commit
80e535e
1 Parent(s): 9a28961

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -3
README.md CHANGED
@@ -1,3 +1,82 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - allenai/c4
5
+ language:
6
+ - en
7
+ pipeline_tag: fill-mask
8
+ ---
9
+ ## gte-multilingual-mlm-base
10
+
11
+
12
+ We introduce `GTE-v1.5` series, new generalized text encoder, embedding and reranking models that the context length of up to 8192.
13
+ The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU, code refer to [`Alibaba-NLP/new-impl`](https://huggingface.co/Alibaba-NLP/new-impl))
14
+ as well as the vocabulary of `bert-base-uncased`.
15
+
16
+ This text encoder is the `GTEv1.5-en-MLM-large-8192` in table 13 of our [paper](https://arxiv.org/pdf/2407.19669).
17
+
18
+ - **Developed by**: Institute for Intelligent Computing, Alibaba Group
19
+ - **Model type**: Text Encoder
20
+ - **Paper**: [mGTE: Generalized Long-Context Text Representation and Reranking
21
+ Models for Multilingual Text Retrieval](https://arxiv.org/pdf/2407.19669).
22
+
23
+ ### Model list
24
+ | Models | Language | Model Size | Max Seq. Length | GLUE | XTREME-R |
25
+ |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: |
26
+ |[`gte-multilingual-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base)| Multiple | 306M | 8192 | 83.47 | 64.44 |
27
+ |[`gte-en-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base) | English | - | 8192 | 85.61 | - |
28
+ |[`gte-en-mlm-large`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-large) | English | - | 8192 | 87.58| - |
29
+
30
+
31
+ ## Training Details
32
+
33
+ ### Training Data
34
+
35
+ - Masked language modeling (MLM): `c4-en`
36
+
37
+ ### Training Procedure
38
+
39
+ To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy.
40
+ The model first undergoes preliminary MLM pre-training on shorter lengths.
41
+ And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.
42
+
43
+ The entire training process is as follows:
44
+ - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
45
+ - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
46
+ - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
47
+
48
+
49
+
50
+ ## Evaluation
51
+
52
+ | Models | Language | Model Size | Max Seq. Length | GLUE | XTREME-R |
53
+ |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: |
54
+ |**[`gte-multilingual-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base)**| Multiple | 306M | 8192 | 83.47 | 64.44 |
55
+ |**[`gte-en-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base)** | English | 137M | 8192 | 85.61 | - |
56
+ |**[`gte-en-mlm-large`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-large)** | English | 435M | 8192 | 87.58| - |
57
+ |[`MosaicBERT-base`](https://huggingface.co/mosaicml/mosaic-bert-base)| English| 137M | 128 | 85.4 | - |
58
+ |[`MosaicBERT-base-2048`](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-2048)| English| 137M | 2048 | 85 | - |
59
+ |`JinaBERT-base`| English| 137M | 512 | 85 | - |
60
+ |[`nomic-bert-2048`](https://huggingface.co/nomic-ai/nomic-bert-2048)| English| 137M | 2048 | 84 | - |
61
+ |`MosaicBERT-large`| English | 434M | 128 | 86.1 | - |
62
+ |`JinaBERT-large`| English | 434M | 512 | 83.7 | - |
63
+ |[`XLM-R-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) | Multiple | 279M | 512 | 80.44 | 62.02 |
64
+ |[`RoBERTa-base`](https://huggingface.co/FacebookAI/roberta-base)| English | 125M | 512 | 86.4 | - |
65
+ |[`RoBERTa-large`](https://huggingface.co/FacebookAI/roberta-large)| English | 355M | 512 | 88.9 | - |
66
+
67
+
68
+ ## Citation
69
+
70
+ If you find our paper or models helpful, please consider citing them as follows:
71
+
72
+ ```
73
+ @misc{zhang2024mgte,
74
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
75
+ author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
76
+ year={2024},
77
+ eprint={2407.19669},
78
+ archivePrefix={arXiv},
79
+ primaryClass={cs.CL},
80
+ url={https://arxiv.org/abs/2407.19669},
81
+ }
82
+ ```