fenchri commited on
Commit
db83a3b
1 Parent(s): addd81c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md CHANGED
@@ -1,3 +1,131 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - af
5
+ - ar
6
+ - bg
7
+ - bn
8
+ - de
9
+ - el
10
+ - en
11
+ - es
12
+ - et
13
+ - eu
14
+ - fa
15
+ - fi
16
+ - fr
17
+ - he
18
+ - hi
19
+ - hu
20
+ - id
21
+ - it
22
+ - ja
23
+ - jv
24
+ - ka
25
+ - kk
26
+ - ko
27
+ - ml
28
+ - mr
29
+ - ms
30
+ - my
31
+ - nl
32
+ - pt
33
+ - ru
34
+ - sw
35
+ - ta
36
+ - te
37
+ - th
38
+ - tl
39
+ - tr
40
+ - ur
41
+ - vi
42
+ - yo
43
+ - zh
44
  ---
45
+
46
+
47
+ # Model Card for EntityCS-39-MLM-xlmr-base
48
+
49
+ This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaced entities in different languages.
50
+ The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
51
+
52
+ Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
53
+ with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
54
+
55
+ To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
56
+ to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
57
+ languages.
58
+ Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (`WEP`) and Partial Entity Prediction (`PEP`).
59
+
60
+ In WEP, motivated by [Sun et al. (2019)](https://arxiv.org/abs/1904.09223) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside
61
+ an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and
62
+ 20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked
63
+ entity, we do not allow replacing with Random subwords, since it can introduce noise and result
64
+ in the model predicting incorrect entities. After entities are masked, we remove the entity indicators
65
+ `<e>`, `</e>` from the sentences before feeding them to the model.
66
+
67
+ For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force
68
+ subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual
69
+ entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First,
70
+ PEP<sub>MRS</sub>, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining
71
+ subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second
72
+ setting, PEP<sub>MS</sub>, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked
73
+ subwords and 10% Same subwords from the masking candidates. In the third setting, PEP<sub>M</sub>, we
74
+ further remove the 10% Same subwords prediction, essentially predicting only the masked subwords.
75
+
76
+ Prior work has proven it is effective to combine
77
+ Entity Prediction with MLM for cross-lingual transfer ([Jiang et al., 2020](https://aclanthology.org/2020.emnlp-main.479/)), therefore we investigate the
78
+ combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the
79
+ entity masking probability (p) to 50% to roughly keep the same overall masking percentage.
80
+ This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM, PEP<sub>MS</sub> + MLM, PEP<sub>M</sub> + MLM
81
+
82
+ This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages.
83
+
84
+
85
+ ## Model Details
86
+
87
+ ### Training Details
88
+
89
+ We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
90
+ We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
91
+ For speedup we use fp16 mixed precision.
92
+ We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
93
+ resource languages get sampled more frequently.
94
+ We only train the embedding and the last two layers of the model.
95
+ We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
96
+
97
+ **This checkpoint corresponds to the one with the lower perplexity on the validation set.**
98
+
99
+
100
+ ## Usage
101
+
102
+ The current model can be used for further fine-tuning on downstream tasks.
103
+ In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation, Fact Retrieval and Slot Filling.
104
+
105
+ ## How to Get Started with the Model
106
+
107
+ Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
108
+
109
+ ## Citation
110
+
111
+ **BibTeX:**
112
+
113
+ ```html
114
+ @inproceedings{whitehouse-etal-2022-entitycs,
115
+ title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching",
116
+ author = "Whitehouse, Chenxi and
117
+ Christopoulou, Fenia and
118
+ Iacobacci, Ignacio",
119
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
120
+ month = dec,
121
+ year = "2022",
122
+ address = "Abu Dhabi, United Arab Emirates",
123
+ publisher = "Association for Computational Linguistics",
124
+ url = "https://aclanthology.org/2022.findings-emnlp.499",
125
+ pages = "6698--6714"
126
+ }
127
+ ```
128
+
129
+ ## Model Card Contact
130
+
131
+ [Fenia Christopoulou](efstathia.christopoulou@huawei.com)