facebook
/

mgenre-wiki

 ---
+# mGENRE
+The mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528) implemented in pytorch.
+In a nutshell, mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned [mBART](https://arxiv.org/abs/2001.08210) architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. The model was first released in the [facebookresearch/GENRE](https://github.com/facebookresearch/GENRE) repository using `fairseq` (the `transformers` models are obtained with a conversion script similar to [this](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py).
+This model was trained on 105 languages from Wikipedia.
+## BibTeX entry and citation info
+**Please consider citing our works if you use code from this repository.**
+```bibtex
+@article{decao2020multilingual,
+    author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel
+    and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke
+    and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
+    title = "{Multilingual Autoregressive Entity Linking}",
+    journal = {Transactions of the Association for Computational Linguistics},
+    volume = {10},
+    pages = {274-290},
+    year = {2022},
+    month = {03},
+    issn = {2307-387X},
+    doi = {10.1162/tacl_a_00460},
+    url = {https://doi.org/10.1162/tacl\_a\_00460},
+    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00460/2004070/tacl\_a\_00460.pdf},
+}
+```
+## Usage
+Here is an example of generation for Wikipedia page disambiguation:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, XLMRobertaTokenizer
+# OPTIONAL: load the prefix tree (trie), you need to additionally download
+# https://huggingface.co/facebook/mgenre-wiki/blob/main/trie.py and
+# https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_trie_with_redirect.pkl
+# that is fast but memory inefficient prefix tree (trie) -- it is implemented with nested python `dict`
+# NOTE: loading this map may take up to 10 minutes and occupy a lot of RAM!
+# import pickle
+# from trie import Trie
+# with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
+#     trie = Trie.load_from_dict(pickle.load(f))
+# or a memory efficient but a bit slower prefix tree (trie) -- it is implemented with `marisa_trie` from
+# https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_marisa_trie_with_redirect.pkl
+# from genre.trie import MarisaTrie
+# with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
+#     trie = pickle.load(f)
+tokenizer = XLMRobertaTokenizer.from_pretrained("facebook/mgenre-wiki")
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()
+sentences = ["[START] Einstein [END] era un fisico tedesco."]
+# Italian for "[START] Einstein [END] was a German physicist."
+outputs = model.generate(
+    **tokenizer(sentences, return_tensors="pt"),
+    num_beams=5,
+    num_return_sequences=5,
+    # OPTIONAL: use constrained beam search
+    # prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
+)
+tokenizer.batch_decode(outputs, skip_special_tokens=True)
+```
+which outputs the following top-5 predictions (using constrained beam search)
+```
+['Albert Einstein >> it',
+ 'Albert Einstein (disambiguation) >> en',
+ 'Alfred Einstein >> it',
+ 'Alberto Einstein >> it',
+ 'Einstein >> it']
+```