huawei-noah
/

EntityCS-39-PEP_MS-xlmr-base

Inference Endpoints

Model card Files Files and versions Community

fenchri commited on Sep 11, 2023

Commit

89848d6

•

1 Parent(s): 4a3c72a

Update README.md

Files changed (1) hide show

README.md +3 -0

README.md CHANGED Viewed

@@ -48,6 +48,9 @@ language:
 This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
 The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
 To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
 to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
 languages.

 This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
 The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
+Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
+with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
 To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
 to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
 languages.