huawei-noah
/

EntityCS-39-PEP_MS_MLM-xlmr-base

Inference Endpoints

Model card Files Files and versions Community

fenchri commited on Sep 11, 2023

Commit

737d296

•

1 Parent(s): c3fa365

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -49,6 +49,9 @@ language:
 This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
 The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
 To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
 to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
 languages.
@@ -86,7 +89,7 @@ This model was trained with the **PEP<sub>MS</sub> + MLM** objective on the Enti
 We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
 We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
 For speedup we use fp16 mixed precision.
-We use the sampling strategy proposed by [Conneau and Lample (2019)](), where high resource languages are down-sampled and low
 resource languages get sampled more frequently.
 We only train the embedding and the last two layers of the model.
 We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.

 This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
 The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
+Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
+with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
 To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
 to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
 languages.
 We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
 We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
 For speedup we use fp16 mixed precision.
+We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
 resource languages get sampled more frequently.
 We only train the embedding and the last two layers of the model.
 We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.