Update README.md
Browse files
README.md
CHANGED
@@ -49,6 +49,9 @@ language:
|
|
49 |
This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
|
50 |
The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
|
51 |
|
|
|
|
|
|
|
52 |
To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
|
53 |
to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
|
54 |
languages.
|
@@ -86,7 +89,7 @@ This model was trained with the **PEP<sub>MS</sub> + MLM** objective on the Enti
|
|
86 |
We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
|
87 |
We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
|
88 |
For speedup we use fp16 mixed precision.
|
89 |
-
We use the sampling strategy proposed by [Conneau and Lample (2019)](), where high resource languages are down-sampled and low
|
90 |
resource languages get sampled more frequently.
|
91 |
We only train the embedding and the last two layers of the model.
|
92 |
We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
|
|
|
49 |
This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
|
50 |
The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
|
51 |
|
52 |
+
Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
|
53 |
+
with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
|
54 |
+
|
55 |
To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
|
56 |
to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
|
57 |
languages.
|
|
|
89 |
We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
|
90 |
We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
|
91 |
For speedup we use fp16 mixed precision.
|
92 |
+
We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
|
93 |
resource languages get sampled more frequently.
|
94 |
We only train the embedding and the last two layers of the model.
|
95 |
We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
|