nicoladecao
commited on
Commit
•
c93b198
1
Parent(s):
bd13c54
Update README.md
Browse files
README.md
CHANGED
@@ -120,5 +120,81 @@ tags:
|
|
120 |
---
|
121 |
|
122 |
|
123 |
-
# mGENRE
|
124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
120 |
---
|
121 |
|
122 |
|
123 |
+
# mGENRE
|
124 |
|
125 |
+
|
126 |
+
The mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528) implemented in pytorch.
|
127 |
+
|
128 |
+
In a nutshell, mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned [mBART](https://arxiv.org/abs/2001.08210) architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. The model was first released in the [facebookresearch/GENRE](https://github.com/facebookresearch/GENRE) repository using `fairseq` (the `transformers` models are obtained with a conversion script similar to [this](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py).
|
129 |
+
|
130 |
+
This model was trained on 105 languages from Wikipedia.
|
131 |
+
|
132 |
+
## BibTeX entry and citation info
|
133 |
+
|
134 |
+
**Please consider citing our works if you use code from this repository.**
|
135 |
+
|
136 |
+
```bibtex
|
137 |
+
@article{decao2020multilingual,
|
138 |
+
author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel
|
139 |
+
and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke
|
140 |
+
and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
|
141 |
+
title = "{Multilingual Autoregressive Entity Linking}",
|
142 |
+
journal = {Transactions of the Association for Computational Linguistics},
|
143 |
+
volume = {10},
|
144 |
+
pages = {274-290},
|
145 |
+
year = {2022},
|
146 |
+
month = {03},
|
147 |
+
issn = {2307-387X},
|
148 |
+
doi = {10.1162/tacl_a_00460},
|
149 |
+
url = {https://doi.org/10.1162/tacl\_a\_00460},
|
150 |
+
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00460/2004070/tacl\_a\_00460.pdf},
|
151 |
+
}
|
152 |
+
```
|
153 |
+
|
154 |
+
## Usage
|
155 |
+
|
156 |
+
Here is an example of generation for Wikipedia page disambiguation:
|
157 |
+
|
158 |
+
```python
|
159 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, XLMRobertaTokenizer
|
160 |
+
|
161 |
+
# OPTIONAL: load the prefix tree (trie), you need to additionally download
|
162 |
+
# https://huggingface.co/facebook/mgenre-wiki/blob/main/trie.py and
|
163 |
+
# https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_trie_with_redirect.pkl
|
164 |
+
# that is fast but memory inefficient prefix tree (trie) -- it is implemented with nested python `dict`
|
165 |
+
# NOTE: loading this map may take up to 10 minutes and occupy a lot of RAM!
|
166 |
+
# import pickle
|
167 |
+
# from trie import Trie
|
168 |
+
# with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
|
169 |
+
# trie = Trie.load_from_dict(pickle.load(f))
|
170 |
+
|
171 |
+
# or a memory efficient but a bit slower prefix tree (trie) -- it is implemented with `marisa_trie` from
|
172 |
+
# https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_marisa_trie_with_redirect.pkl
|
173 |
+
# from genre.trie import MarisaTrie
|
174 |
+
# with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
|
175 |
+
# trie = pickle.load(f)
|
176 |
+
|
177 |
+
tokenizer = XLMRobertaTokenizer.from_pretrained("facebook/mgenre-wiki")
|
178 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()
|
179 |
+
|
180 |
+
sentences = ["[START] Einstein [END] era un fisico tedesco."]
|
181 |
+
# Italian for "[START] Einstein [END] was a German physicist."
|
182 |
+
|
183 |
+
outputs = model.generate(
|
184 |
+
**tokenizer(sentences, return_tensors="pt"),
|
185 |
+
num_beams=5,
|
186 |
+
num_return_sequences=5,
|
187 |
+
# OPTIONAL: use constrained beam search
|
188 |
+
# prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
|
189 |
+
)
|
190 |
+
|
191 |
+
tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
192 |
+
```
|
193 |
+
which outputs the following top-5 predictions (using constrained beam search)
|
194 |
+
```
|
195 |
+
['Albert Einstein >> it',
|
196 |
+
'Albert Einstein (disambiguation) >> en',
|
197 |
+
'Alfred Einstein >> it',
|
198 |
+
'Alberto Einstein >> it',
|
199 |
+
'Einstein >> it']
|
200 |
+
```
|