flan-rebel-nl

This model is a fine-tuned version of flan-t5-base on the rebel-short dataset. It achieves the following results on the evaluation set:

Loss: 0.1029
Rouge1: 51.5716
Rouge2: 40.2152
Rougel: 49.9941
Rougelsum: 49.9767
Gen Len: 18.5898

Model description

This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.

Pipeline usage

The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .

from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)

A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.


triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")

#extract ents
ner_output = ner_extractor(input_text)
ents = [i["word"] for i in ner_output]

if len(ents) > 0:

    tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
    extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)

else:
    extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
triplets = extract_triplets(extracted_text[0]["generated_text"])

Training and evaluation data

Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Gen Len
0.1256	1.0	22047	0.1206	50.3892	38.2761	48.7657	48.7444	18.6112
0.1091	2.0	44094	0.1112	50.9615	39.2843	49.3865	49.3674	18.5447
0.0875	3.0	66141	0.1047	51.2045	39.7598	49.6483	49.6317	18.5763
0.0841	4.0	88188	0.1036	51.3543	39.9776	49.8528	49.8223	18.6178
0.0806	5.0	110235	0.1029	51.5716	40.2152	49.9941	49.9767	18.5898

Framework versions

Transformers 4.27.2
Pytorch 1.13.1+cu117
Datasets 2.10.1
Tokenizers 0.12.1

Kbrek
/

flan_rebel_nl