Babelscape
/

mrebel-large-32

@@ -51,9 +51,9 @@ mREBEL is introduced in the ACL 2023 paper [RED^{FM}: a Filtered and Multilingua
         url = "https://arxiv.org/abs/2306.09802",
     }
-The original repository for the paper can be found [here](https://github.com/Babelscape/rebel)
-Be aware that the inference widget at the right does not output special tokens, which are necessary to distinguish the subject, object and relation types. For a demo of REBEL and its pre-training dataset check the [Spaces demo](https://huggingface.co/spaces/Babelscape/rebel-demo).
 ## Pipeline usage
@@ -146,7 +146,11 @@ def extract_triplets_typed(text):
     return triplets
 # Load model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("Babelscape/mrebel-large-32", src_lang="en_XX", "tgt_lang": "tp_XX") # Here we set English as source language. To change the source language just change it here or swap the first token of the input for your desired language
 model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/mrebel-large-32")
 gen_kwargs = {
     "max_length": 256,
@@ -166,7 +170,7 @@ model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, re
 generated_tokens = model.generate(
     model_inputs["input_ids"].to(model.device),
     attention_mask=model_inputs["attention_mask"].to(model.device),
-    decoder_start_token_id = self.tokenizer.convert_tokens_to_ids("tp_XX"),
     **gen_kwargs,
 )
@@ -179,6 +183,7 @@ for idx, sentence in enumerate(decoded_preds):
     print(extract_triplets_typed(sentence))
 ```
 ## License
 This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-nc-sa/4.0/).

         url = "https://arxiv.org/abs/2306.09802",
     }
+The original repository for the paper can be found [here](https://github.com/Babelscape/rebel#REDFM)
+Be aware that the inference widget at the right does not output special tokens, which are necessary to distinguish the subject, object and relation types. For a demo of mREBEL and its pre-training dataset check the [Spaces demo](https://huggingface.co/spaces/Babelscape/mrebel-demo).
 ## Pipeline usage
     return triplets
 # Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("Babelscape/mrebel-large-32", src_lang="en_XX", tgt_lang="tp_XX")
+# Here we set English ("en_XX") as source language. To change the source language swap the first token of the input for your desired language or change to supported language. For catalan ("ca_XX") or greek ("el_EL") (not included in mBART pretraining) you need a workaround:
+# tokenizer._src_lang = "ca_XX"
+# tokenizer.cur_lang_code_id = tokenizer.convert_tokens_to_ids("ca_XX")
+# tokenizer.set_src_lang_special_tokens("ca_XX")
 model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/mrebel-large-32")
 gen_kwargs = {
     "max_length": 256,
 generated_tokens = model.generate(
     model_inputs["input_ids"].to(model.device),
     attention_mask=model_inputs["attention_mask"].to(model.device),
+    decoder_start_token_id = tokenizer.convert_tokens_to_ids("tp_XX"),
     **gen_kwargs,
 )
     print(extract_triplets_typed(sentence))
 ```
 ## License
 This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-nc-sa/4.0/).