nicoladecao commited on
Commit
c93b198
1 Parent(s): bd13c54

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -1
README.md CHANGED
@@ -120,5 +120,81 @@ tags:
120
  ---
121
 
122
 
123
- # mGENRE (work in progress...)
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  ---
121
 
122
 
123
+ # mGENRE
124
 
125
+
126
+ The mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528) implemented in pytorch.
127
+
128
+ In a nutshell, mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned [mBART](https://arxiv.org/abs/2001.08210) architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. The model was first released in the [facebookresearch/GENRE](https://github.com/facebookresearch/GENRE) repository using `fairseq` (the `transformers` models are obtained with a conversion script similar to [this](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py).
129
+
130
+ This model was trained on 105 languages from Wikipedia.
131
+
132
+ ## BibTeX entry and citation info
133
+
134
+ **Please consider citing our works if you use code from this repository.**
135
+
136
+ ```bibtex
137
+ @article{decao2020multilingual,
138
+ author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel
139
+ and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke
140
+ and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
141
+ title = "{Multilingual Autoregressive Entity Linking}",
142
+ journal = {Transactions of the Association for Computational Linguistics},
143
+ volume = {10},
144
+ pages = {274-290},
145
+ year = {2022},
146
+ month = {03},
147
+ issn = {2307-387X},
148
+ doi = {10.1162/tacl_a_00460},
149
+ url = {https://doi.org/10.1162/tacl\_a\_00460},
150
+ eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00460/2004070/tacl\_a\_00460.pdf},
151
+ }
152
+ ```
153
+
154
+ ## Usage
155
+
156
+ Here is an example of generation for Wikipedia page disambiguation:
157
+
158
+ ```python
159
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, XLMRobertaTokenizer
160
+
161
+ # OPTIONAL: load the prefix tree (trie), you need to additionally download
162
+ # https://huggingface.co/facebook/mgenre-wiki/blob/main/trie.py and
163
+ # https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_trie_with_redirect.pkl
164
+ # that is fast but memory inefficient prefix tree (trie) -- it is implemented with nested python `dict`
165
+ # NOTE: loading this map may take up to 10 minutes and occupy a lot of RAM!
166
+ # import pickle
167
+ # from trie import Trie
168
+ # with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
169
+ # trie = Trie.load_from_dict(pickle.load(f))
170
+
171
+ # or a memory efficient but a bit slower prefix tree (trie) -- it is implemented with `marisa_trie` from
172
+ # https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_marisa_trie_with_redirect.pkl
173
+ # from genre.trie import MarisaTrie
174
+ # with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
175
+ # trie = pickle.load(f)
176
+
177
+ tokenizer = XLMRobertaTokenizer.from_pretrained("facebook/mgenre-wiki")
178
+ model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()
179
+
180
+ sentences = ["[START] Einstein [END] era un fisico tedesco."]
181
+ # Italian for "[START] Einstein [END] was a German physicist."
182
+
183
+ outputs = model.generate(
184
+ **tokenizer(sentences, return_tensors="pt"),
185
+ num_beams=5,
186
+ num_return_sequences=5,
187
+ # OPTIONAL: use constrained beam search
188
+ # prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
189
+ )
190
+
191
+ tokenizer.batch_decode(outputs, skip_special_tokens=True)
192
+ ```
193
+ which outputs the following top-5 predictions (using constrained beam search)
194
+ ```
195
+ ['Albert Einstein >> it',
196
+ 'Albert Einstein (disambiguation) >> en',
197
+ 'Alfred Einstein >> it',
198
+ 'Alberto Einstein >> it',
199
+ 'Einstein >> it']
200
+ ```