huawei-noah
/

EntityCS-39-WEP-xlmr-base

 ---
 license: apache-2.0
+language:
+- af
+- ar
+- bg
+- bn
+- de
+- el
+- en
+- es
+- et
+- eu
+- fa
+- fi
+- fr
+- he
+- hi
+- hu
+- id
+- it
+- ja
+- jv
+- ka
+- kk
+- ko
+- ml
+- mr
+- ms
+- my
+- nl
+- pt
+- ru
+- sw
+- ta
+- te
+- th
+- tl
+- tr
+- ur
+- vi
+- yo
+- zh
 ---
+# Model Card for EntityCS-39-MLM-xlmr-base
+This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaced entities in different languages.
+The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
+Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
+with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
+To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
+to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
+languages.
+Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (`WEP`) and Partial Entity Prediction (`PEP`).
+In WEP, motivated by [Sun et al. (2019)](https://arxiv.org/abs/1904.09223) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside
+an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and
+20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked
+entity, we do not allow replacing with Random subwords, since it can introduce noise and result
+in the model predicting incorrect entities. After entities are masked, we remove the entity indicators
+`<e>`, `</e>` from the sentences before feeding them to the model.
+For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force
+subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual
+entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First,
+PEP<sub>MRS</sub>, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining
+subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second
+setting, PEP<sub>MS</sub>, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked
+subwords and 10% Same subwords from the masking candidates. In the third setting, PEP<sub>M</sub>, we
+further remove the 10% Same subwords prediction, essentially predicting only the masked subwords.
+Prior work has proven it is effective to combine
+Entity Prediction with MLM for cross-lingual transfer ([Jiang et al., 2020](https://aclanthology.org/2020.emnlp-main.479/)), therefore we investigate the
+combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the
+entity masking probability (p) to 50% to roughly keep the same overall masking percentage.
+This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM, PEP<sub>MS</sub> + MLM, PEP<sub>M</sub> + MLM
+This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages.
+## Model Details
+### Training Details
+We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
+We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
+For speedup we use fp16 mixed precision.
+We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
+resource languages get sampled more frequently.
+We only train the embedding and the last two layers of the model.
+We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
+**This checkpoint corresponds to the one with the lower perplexity on the validation set.**
+## Usage
+The current model can be used for further fine-tuning on downstream tasks.
+In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation, Fact Retrieval and Slot Filling.
+## How to Get Started with the Model
+Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
+## Citation
+**BibTeX:**
+```html
+@inproceedings{whitehouse-etal-2022-entitycs,
+    title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching",
+    author = "Whitehouse, Chenxi  and
+      Christopoulou, Fenia  and
+      Iacobacci, Ignacio",
+    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
+    month = dec,
+    year = "2022",
+    address = "Abu Dhabi, United Arab Emirates",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.findings-emnlp.499",
+    pages = "6698--6714"
+}
+```
+## Model Card Contact
+[Fenia Christopoulou](efstathia.christopoulou@huawei.com)