fenchri commited on
Commit
737d296
1 Parent(s): c3fa365

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -49,6 +49,9 @@ language:
49
  This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
50
  The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
51
 
 
 
 
52
  To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
53
  to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
54
  languages.
@@ -86,7 +89,7 @@ This model was trained with the **PEP<sub>MS</sub> + MLM** objective on the Enti
86
  We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
87
  We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
88
  For speedup we use fp16 mixed precision.
89
- We use the sampling strategy proposed by [Conneau and Lample (2019)](), where high resource languages are down-sampled and low
90
  resource languages get sampled more frequently.
91
  We only train the embedding and the last two layers of the model.
92
  We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
 
49
  This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
50
  The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
51
 
52
+ Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
53
+ with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
54
+
55
  To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
56
  to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
57
  languages.
 
89
  We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
90
  We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
91
  For speedup we use fp16 mixed precision.
92
+ We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
93
  resource languages get sampled more frequently.
94
  We only train the embedding and the last two layers of the model.
95
  We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.