PereLluis13 commited on
Commit
7e209f6
1 Parent(s): 080970f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - ca
5
+ - de
6
+ - en
7
+ - es
8
+ - fa
9
+ - fr
10
+ - it
11
+ - ko
12
+ - nl
13
+ - pl
14
+ - pt
15
+ - ru
16
+ - sv
17
+ - uk
18
+ widget:
19
+ - text: >-
20
+ The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.
21
+ tags:
22
+ - seq2seq
23
+ - relation-extraction
24
+
25
+ license: cc-by-nc-sa-4.0
26
+ ---
27
+ # RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset
28
+
29
+ This a multilingual version of [REBEL](https://huggingface.co/Babelscape/rebel-large). It can be used as a standalone multulingual Relation Extraction system, or as a pretrained system to be tuned on multilingual Relation Extraction datasets.
30
+
31
+ mREBEL is introduced in the ACL 2023 paper [RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset](https://github.com/Babelscape/rebel/blob/main/docs/). We present a new multilingual Relation Extraction dataset and train a multilingual version of REBEL which reframed Relation Extraction as a seq2seq task. The paper can be found [here](https://github.com/Babelscape/rebel/blob/main/docs/). If you use the code or model, please reference this work in your paper:
32
+
33
+ @inproceedings{huguet-cabot-et-al-2023-red,
34
+ title = "RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset",
35
+ author = "Huguet Cabot, Pere-Llu{\'\i}s and
36
+ Navigli, Roberto",
37
+ booktitle = "ACL 2023",
38
+ month = jul,
39
+ year = "2023",
40
+ address = "Toronto, Canada",
41
+ publisher = "Association for Computational Linguistics",
42
+ }
43
+
44
+ The original repository for the paper can be found [here](https://github.com/Babelscape/rebel)
45
+
46
+ Be aware that the inference widget at the right does not output special tokens, which are necessary to distinguish the subject, object and relation types. For a demo of REBEL and its pre-training dataset check the [Spaces demo](https://huggingface.co/spaces/Babelscape/rebel-demo).
47
+
48
+ ## Pipeline usage
49
+
50
+ ```python
51
+ from transformers import pipeline
52
+
53
+ triplet_extractor = pipeline('text2text-generation', model='Babelscape/mrebel-large', tokenizer='Babelscape/mrebel-large')
54
+ # We need to use the tokenizer manually since we need special tokens.
55
+ extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
56
+ print(extracted_text[0])
57
+ # Function to parse the generated text and extract the triplets
58
+ def extract_triplets(text):
59
+ triplets = []
60
+ relation, subject, relation, object_ = '', '', '', ''
61
+ text = text.strip()
62
+ current = 'x'
63
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
64
+ if token == "<triplet>":
65
+ current = 't'
66
+ if relation != '':
67
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
68
+ relation = ''
69
+ subject = ''
70
+ elif token == "<subj>":
71
+ current = 's'
72
+ if relation != '':
73
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
74
+ object_ = ''
75
+ elif token == "<obj>":
76
+ current = 'o'
77
+ relation = ''
78
+ else:
79
+ if current == 't':
80
+ subject += ' ' + token
81
+ elif current == 's':
82
+ object_ += ' ' + token
83
+ elif current == 'o':
84
+ relation += ' ' + token
85
+ if subject != '' and relation != '' and object_ != '':
86
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
87
+ return triplets
88
+ extracted_triplets = extract_triplets(extracted_text[0])
89
+ print(extracted_triplets)
90
+ ```
91
+
92
+ ## Model and Tokenizer using transformers
93
+
94
+ ```python
95
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
96
+
97
+ def extract_triplets(text):
98
+ triplets = []
99
+ relation, subject, relation, object_ = '', '', '', ''
100
+ text = text.strip()
101
+ current = 'x'
102
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
103
+ if token == "<triplet>":
104
+ current = 't'
105
+ if relation != '':
106
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
107
+ relation = ''
108
+ subject = ''
109
+ elif token == "<subj>":
110
+ current = 's'
111
+ if relation != '':
112
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
113
+ object_ = ''
114
+ elif token == "<obj>":
115
+ current = 'o'
116
+ relation = ''
117
+ else:
118
+ if current == 't':
119
+ subject += ' ' + token
120
+ elif current == 's':
121
+ object_ += ' ' + token
122
+ elif current == 'o':
123
+ relation += ' ' + token
124
+ if subject != '' and relation != '' and object_ != '':
125
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
126
+ return triplets
127
+
128
+ # Load model and tokenizer
129
+ tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
130
+ model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
131
+ gen_kwargs = {
132
+ "max_length": 256,
133
+ "length_penalty": 0,
134
+ "num_beams": 3,
135
+ "num_return_sequences": 3,
136
+ }
137
+
138
+ # Text to extract triplets from
139
+ text = 'The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.'
140
+
141
+ # Tokenizer text
142
+ model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt')
143
+
144
+ # Generate
145
+ generated_tokens = model.generate(
146
+ model_inputs["input_ids"].to(model.device),
147
+ attention_mask=model_inputs["attention_mask"].to(model.device),
148
+ **gen_kwargs,
149
+ )
150
+
151
+ # Extract text
152
+ decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
153
+
154
+ # Extract triplets
155
+ for idx, sentence in enumerate(decoded_preds):
156
+ print(f'Prediction triplets sentence {idx}')
157
+ print(extract_triplets(sentence))
158
+ ```