PereLluis13 commited on
Commit
1ac42eb
1 Parent(s): f6a1983

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - ca
5
+ - de
6
+ - el
7
+ - en
8
+ - es
9
+ - fr
10
+ - hi
11
+ - it
12
+ - ja
13
+ - ko
14
+ - nl
15
+ - pl
16
+ - pt
17
+ - ru
18
+ - sv
19
+ - vi
20
+ - zh
21
+ widget:
22
+ - text: >-
23
+ The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.
24
+ tags:
25
+ - seq2seq
26
+ - relation-extraction
27
+
28
+ license: cc-by-nc-sa-4.0
29
+ ---
30
+ # RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset
31
+
32
+ This a multilingual version of [REBEL](https://huggingface.co/Babelscape/rebel-large). It can be used as a standalone multulingual Relation Extraction system, or as a pretrained system to be tuned on multilingual Relation Extraction datasets.
33
+
34
+ mREBEL is introduced in the ACL 2023 paper [RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset](https://github.com/Babelscape/rebel/blob/main/docs/). We present a new multilingual Relation Extraction dataset and train a multilingual version of REBEL which reframed Relation Extraction as a seq2seq task. The paper can be found [here](https://github.com/Babelscape/rebel/blob/main/docs/). If you use the code or model, please reference this work in your paper:
35
+
36
+ @inproceedings{huguet-cabot-et-al-2023-red,
37
+ title = "RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset",
38
+ author = "Huguet Cabot, Pere-Llu{\'\i}s and
39
+ Navigli, Roberto",
40
+ booktitle = "ACL 2023",
41
+ month = jul,
42
+ year = "2023",
43
+ address = "Toronto, Canada",
44
+ publisher = "Association for Computational Linguistics",
45
+ }
46
+
47
+ The original repository for the paper can be found [here](https://github.com/Babelscape/rebel)
48
+
49
+ Be aware that the inference widget at the right does not output special tokens, which are necessary to distinguish the subject, object and relation types. For a demo of REBEL and its pre-training dataset check the [Spaces demo](https://huggingface.co/spaces/Babelscape/rebel-demo).
50
+
51
+ ## Pipeline usage
52
+
53
+ ```python
54
+ from transformers import pipeline
55
+
56
+ triplet_extractor = pipeline('text2text-generation', model='Babelscape/mrebel-large', tokenizer='Babelscape/mrebel-large')
57
+ # We need to use the tokenizer manually since we need special tokens.
58
+ extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
59
+ print(extracted_text[0])
60
+ # Function to parse the generated text and extract the triplets
61
+ def extract_triplets(text):
62
+ triplets = []
63
+ relation, subject, relation, object_ = '', '', '', ''
64
+ text = text.strip()
65
+ current = 'x'
66
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
67
+ if token == "<triplet>":
68
+ current = 't'
69
+ if relation != '':
70
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
71
+ relation = ''
72
+ subject = ''
73
+ elif token == "<subj>":
74
+ current = 's'
75
+ if relation != '':
76
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
77
+ object_ = ''
78
+ elif token == "<obj>":
79
+ current = 'o'
80
+ relation = ''
81
+ else:
82
+ if current == 't':
83
+ subject += ' ' + token
84
+ elif current == 's':
85
+ object_ += ' ' + token
86
+ elif current == 'o':
87
+ relation += ' ' + token
88
+ if subject != '' and relation != '' and object_ != '':
89
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
90
+ return triplets
91
+ extracted_triplets = extract_triplets(extracted_text[0])
92
+ print(extracted_triplets)
93
+ ```
94
+
95
+ ## Model and Tokenizer using transformers
96
+
97
+ ```python
98
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
99
+
100
+ def extract_triplets(text):
101
+ triplets = []
102
+ relation, subject, relation, object_ = '', '', '', ''
103
+ text = text.strip()
104
+ current = 'x'
105
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
106
+ if token == "<triplet>":
107
+ current = 't'
108
+ if relation != '':
109
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
110
+ relation = ''
111
+ subject = ''
112
+ elif token == "<subj>":
113
+ current = 's'
114
+ if relation != '':
115
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
116
+ object_ = ''
117
+ elif token == "<obj>":
118
+ current = 'o'
119
+ relation = ''
120
+ else:
121
+ if current == 't':
122
+ subject += ' ' + token
123
+ elif current == 's':
124
+ object_ += ' ' + token
125
+ elif current == 'o':
126
+ relation += ' ' + token
127
+ if subject != '' and relation != '' and object_ != '':
128
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
129
+ return triplets
130
+
131
+ # Load model and tokenizer
132
+ tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
133
+ model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
134
+ gen_kwargs = {
135
+ "max_length": 256,
136
+ "length_penalty": 0,
137
+ "num_beams": 3,
138
+ "num_return_sequences": 3,
139
+ }
140
+
141
+ # Text to extract triplets from
142
+ text = 'The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.'
143
+
144
+ # Tokenizer text
145
+ model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt')
146
+
147
+ # Generate
148
+ generated_tokens = model.generate(
149
+ model_inputs["input_ids"].to(model.device),
150
+ attention_mask=model_inputs["attention_mask"].to(model.device),
151
+ **gen_kwargs,
152
+ )
153
+
154
+ # Extract text
155
+ decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
156
+
157
+ # Extract triplets
158
+ for idx, sentence in enumerate(decoded_preds):
159
+ print(f'Prediction triplets sentence {idx}')
160
+ print(extract_triplets(sentence))
161
+ ```