huawei-noah
/

EntityCS-39-MLM-xlmr-base

Fill-Mask

Transformers

PyTorch

xlm-roberta

Inference Endpoints

Model card Files Files and versions Community

fenchri commited on Sep 13, 2023

Commit

c1f1a7d

•

1 Parent(s): 3f1a185

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -9

README.md CHANGED Viewed

@@ -46,10 +46,13 @@ language:
 # Model Card for EntityCS-39-MLM-xlmr-base
 This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages.
 The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
-Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
 with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
 To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
@@ -82,14 +85,12 @@ This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM,
 This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages.
-## Model Details
-### Training Details
 We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
 We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
 For speedup we use fp16 mixed precision.
-We use the sampling strategy proposed by [Conneau and Lample (2019)](), where high resource languages are down-sampled and low
 resource languages get sampled more frequently.
 We only train the embedding and the last two layers of the model.
 We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
@@ -104,9 +105,12 @@ In the paper, we focused on entity-related tasks, such as NER, Word Sense Disamb
 Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as [X-FACTR](https://aclanthology.org/2020.emnlp-main.479/).
 ## How to Get Started with the Model
-Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
 ## Citation
@@ -128,6 +132,8 @@ Use the code below to get started with the model: https://github.com/huawei-noah
 }
 ```
-## Model Card Contact
-[Fenia Christopoulou](mailto:efstathia.christopoulou@huawei.com)

 # Model Card for EntityCS-39-MLM-xlmr-base
+- Paper: https://aclanthology.org/2022.findings-emnlp.499.pdf
+- Repository: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
+- Point of Contact: [Fenia Christopoulou](mailto:efstathia.christopoulou@huawei.com), [Chenxi Whitehouse](mailto:chenxi.whitehouse@gmail.com)
 This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages.
 The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
+To train models on the corpus, we first employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
 with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
 To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
 This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages.
+## Training Details
 We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
 We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
 For speedup we use fp16 mixed precision.
+We use the sampling strategy proposed by [Conneau and Lample (2019)](https://dl.acm.org/doi/pdf/10.5555/3454287.3454921), where high resource languages are down-sampled and low
 resource languages get sampled more frequently.
 We only train the embedding and the last two layers of the model.
 We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
 Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as [X-FACTR](https://aclanthology.org/2020.emnlp-main.479/).
+For results on each downstream task, please refer to the paper.
 ## How to Get Started with the Model
+Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
 ## Citation
 }
 ```
+**APA:**
+```html
+Whitehouse, C., Christopoulou, F., & Iacobacci, I. (2022). EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6698–6714). Association for Computational Linguistics.
+```