fenchri commited on
Commit
c3fa365
1 Parent(s): 89c8c13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md CHANGED
@@ -1,3 +1,130 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - af
5
+ - ar
6
+ - bg
7
+ - bn
8
+ - de
9
+ - el
10
+ - en
11
+ - es
12
+ - et
13
+ - eu
14
+ - fa
15
+ - fi
16
+ - fr
17
+ - he
18
+ - hi
19
+ - hu
20
+ - id
21
+ - it
22
+ - ja
23
+ - jv
24
+ - ka
25
+ - kk
26
+ - ko
27
+ - ml
28
+ - mr
29
+ - ms
30
+ - my
31
+ - nl
32
+ - pt
33
+ - ru
34
+ - sw
35
+ - ta
36
+ - te
37
+ - th
38
+ - tl
39
+ - tr
40
+ - ur
41
+ - vi
42
+ - yo
43
+ - zh
44
  ---
45
+
46
+
47
+ # Model Card for EntityCS-39-PEP_MS_MLM-xlmr-base
48
+
49
+ This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
50
+ The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
51
+
52
+ To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
53
+ to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
54
+ languages.
55
+ Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (`WEP`) and Partial Entity Prediction (`PEP`).
56
+
57
+ In WEP, motivated by [Sun et al. (2019)](https://arxiv.org/abs/1904.09223) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside
58
+ an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and
59
+ 20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked
60
+ entity, we do not allow replacing with Random subwords, since it can introduce noise and result
61
+ in the model predicting incorrect entities. After entities are masked, we remove the entity indicators
62
+ `<e>`, `</e>` from the sentences before feeding them to the model.
63
+
64
+ For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force
65
+ subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual
66
+ entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First,
67
+ PEP<sub>MRS</sub>, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining
68
+ subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second
69
+ setting, PEP<sub>MS</sub>, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked
70
+ subwords and 10% Same subwords from the masking candidates. In the third setting, PEP<sub>M</sub>, we
71
+ further remove the 10% Same subwords prediction, essentially predicting only the masked subwords.
72
+
73
+ Prior work has proven it is effective to combine
74
+ Entity Prediction with MLM for cross-lingual transfer ([Jiang et al., 2020](https://aclanthology.org/2020.emnlp-main.479/)), therefore we investigate the
75
+ combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the
76
+ entity masking probability (p) to 50% to roughly keep the same overall masking percentage.
77
+ This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM, PEP<sub>MS</sub> + MLM, PEP<sub>M</sub> + MLM
78
+
79
+ This model was trained with the **PEP<sub>MS</sub> + MLM** objective on the EntityCS corpus with 39 languages.
80
+
81
+
82
+ ## Model Details
83
+
84
+ ### Training Details
85
+
86
+ We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
87
+ We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
88
+ For speedup we use fp16 mixed precision.
89
+ We use the sampling strategy proposed by [Conneau and Lample (2019)](), where high resource languages are down-sampled and low
90
+ resource languages get sampled more frequently.
91
+ We only train the embedding and the last two layers of the model.
92
+ We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
93
+
94
+ **This checkpoint corresponds to the one with the lower perplexity on the validation set.**
95
+
96
+
97
+ ## Usage
98
+
99
+ The current model can be used for further fine-tuning on downstream tasks.
100
+ In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation, Fact Retrieval and Slot Filling.
101
+
102
+ ## How to Get Started with the Model
103
+
104
+ Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
105
+
106
+ ## Citation
107
+
108
+ **BibTeX:**
109
+
110
+ ```html
111
+ @inproceedings{whitehouse-etal-2022-entitycs,
112
+ title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching",
113
+ author = "Whitehouse, Chenxi and
114
+ Christopoulou, Fenia and
115
+ Iacobacci, Ignacio",
116
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
117
+ month = dec,
118
+ year = "2022",
119
+ address = "Abu Dhabi, United Arab Emirates",
120
+ publisher = "Association for Computational Linguistics",
121
+ url = "https://aclanthology.org/2022.findings-emnlp.499",
122
+ pages = "6698--6714"
123
+ }
124
+ ```
125
+
126
+ ## Model Card Contact
127
+
128
+ [Fenia Christopoulou](efstathia.christopoulou@huawei.com)
129
+
130
+