Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- af
|
5 |
+
- ar
|
6 |
+
- bg
|
7 |
+
- bn
|
8 |
+
- de
|
9 |
+
- el
|
10 |
+
- en
|
11 |
+
- es
|
12 |
+
- et
|
13 |
+
- eu
|
14 |
+
- fa
|
15 |
+
- fi
|
16 |
+
- fr
|
17 |
+
- he
|
18 |
+
- hi
|
19 |
+
- hu
|
20 |
+
- id
|
21 |
+
- it
|
22 |
+
- ja
|
23 |
+
- jv
|
24 |
+
- ka
|
25 |
+
- kk
|
26 |
+
- ko
|
27 |
+
- ml
|
28 |
+
- mr
|
29 |
+
- ms
|
30 |
+
- my
|
31 |
+
- nl
|
32 |
+
- pt
|
33 |
+
- ru
|
34 |
+
- sw
|
35 |
+
- ta
|
36 |
+
- te
|
37 |
+
- th
|
38 |
+
- tl
|
39 |
+
- tr
|
40 |
+
- ur
|
41 |
+
- vi
|
42 |
+
- yo
|
43 |
+
- zh
|
44 |
+
---
|
45 |
+
|
46 |
+
|
47 |
+
# Model Card for EntityCS-39-PEP_MS_MLM-xlmr-base
|
48 |
+
|
49 |
+
This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages.
|
50 |
+
The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
|
51 |
+
|
52 |
+
Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
|
53 |
+
with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
|
54 |
+
|
55 |
+
To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
|
56 |
+
to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
|
57 |
+
languages.
|
58 |
+
Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (`WEP`) and Partial Entity Prediction (`PEP`).
|
59 |
+
|
60 |
+
In WEP, motivated by [Sun et al. (2019)](https://arxiv.org/abs/1904.09223) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside
|
61 |
+
an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and
|
62 |
+
20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked
|
63 |
+
entity, we do not allow replacing with Random subwords, since it can introduce noise and result
|
64 |
+
in the model predicting incorrect entities. After entities are masked, we remove the entity indicators
|
65 |
+
`<e>`, `</e>` from the sentences before feeding them to the model.
|
66 |
+
|
67 |
+
For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force
|
68 |
+
subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual
|
69 |
+
entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First,
|
70 |
+
PEP<sub>MRS</sub>, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining
|
71 |
+
subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second
|
72 |
+
setting, PEP<sub>MS</sub>, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked
|
73 |
+
subwords and 10% Same subwords from the masking candidates. In the third setting, PEP<sub>M</sub>, we
|
74 |
+
further remove the 10% Same subwords prediction, essentially predicting only the masked subwords.
|
75 |
+
|
76 |
+
Prior work has proven it is effective to combine
|
77 |
+
Entity Prediction with MLM for cross-lingual transfer ([Jiang et al., 2020](https://aclanthology.org/2020.emnlp-main.479/)), therefore we investigate the
|
78 |
+
combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the
|
79 |
+
entity masking probability (p) to 50% to roughly keep the same overall masking percentage.
|
80 |
+
This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM, PEP<sub>MS</sub> + MLM, PEP<sub>M</sub> + MLM
|
81 |
+
|
82 |
+
This model was trained with the **PEP<sub>MS</sub> + MLM** objective on the EntityCS corpus with 39 languages.
|
83 |
+
|
84 |
+
|
85 |
+
## Model Details
|
86 |
+
|
87 |
+
### Training Details
|
88 |
+
|
89 |
+
We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
|
90 |
+
We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
|
91 |
+
For speedup we use fp16 mixed precision.
|
92 |
+
We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
|
93 |
+
resource languages get sampled more frequently.
|
94 |
+
We only train the embedding and the last two layers of the model.
|
95 |
+
We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
|
96 |
+
|
97 |
+
**This checkpoint corresponds to the one with the lower perplexity on the validation set.**
|
98 |
+
|
99 |
+
|
100 |
+
## Usage
|
101 |
+
|
102 |
+
The current model can be used for further fine-tuning on downstream tasks.
|
103 |
+
In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation and Slot Filling.
|
104 |
+
|
105 |
+
Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as [X-FACTR](https://aclanthology.org/2020.emnlp-main.479/).
|
106 |
+
|
107 |
+
## How to Get Started with the Model
|
108 |
+
|
109 |
+
Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
|
110 |
+
|
111 |
+
## Citation
|
112 |
+
|
113 |
+
**BibTeX:**
|
114 |
+
|
115 |
+
```html
|
116 |
+
@inproceedings{whitehouse-etal-2022-entitycs,
|
117 |
+
title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching",
|
118 |
+
author = "Whitehouse, Chenxi and
|
119 |
+
Christopoulou, Fenia and
|
120 |
+
Iacobacci, Ignacio",
|
121 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
|
122 |
+
month = dec,
|
123 |
+
year = "2022",
|
124 |
+
address = "Abu Dhabi, United Arab Emirates",
|
125 |
+
publisher = "Association for Computational Linguistics",
|
126 |
+
url = "https://aclanthology.org/2022.findings-emnlp.499",
|
127 |
+
pages = "6698--6714"
|
128 |
+
}
|
129 |
+
```
|
130 |
+
|
131 |
+
## Model Card Contact
|
132 |
+
|
133 |
+
[Fenia Christopoulou](mailto:efstathia.christopoulou@huawei.com)
|
134 |
+
|