Fine-tuning mrebel-large for named graphs/quads

#4
by aschimmenti - opened

Hello,
I was testing the model with sentences like:
"According to Martin Kemp, Leonardo was the sole author of Salvator Mundi, while according to Jacques Franck the Salvator Mundi was the result of a collaboration between Leonardo da Vinci and Salai"
And got the following output:
[{'head': 'Salvator Mundi', 'head_type': 'media', 'type': 'author', 'tail': 'Leonardo da Vinci', 'tail_type': 'per'}, {'head': 'Salvator Mundi', 'head_type': 'media', 'type': 'author', 'tail': 'Leonardo da Vinci', 'tail_type': 'per'}, {'head': 'Leonardo da Vinci', 'head_type': 'per', 'type': 'notable work', 'tail': 'Salvator Mundi', 'tail_type': 'media'}, {'head': 'Leonardo da Vinci', 'head_type': 'per', 'type': 'notable work', 'tail': 'Salvator Mundi', 'tail_type': 'media'}]
Technically it's right, but it completely skipped any of the provenance mentions.
I would like to add this option through fine-tuning (which would require also substantial changes in the output format), but I don't think I can achieve it with only that. Do you have any suggestions?

There's already the "performer" relation when a person makes a statement, which overrides however the stated relationship (Sabrina - is - beautifulGirl); or completely irrational outputs. A couple examples:
"Marco says Sabrina is a beautiful girl" => [{'head': 'Sabrina', 'head_type': 'media', 'type': 'performer', 'tail': 'Marco', 'tail_type': 'per'}]
"Al Jazeera reports about the IDF occupying Gaza" => [{'head': 'Al Jazeera', 'head_type': 'concept', 'type': 'part of', 'tail': 'IDF', 'tail_type': 'concept'}]

I have access to a big dataset (few million triples) of statements about statements (conflicting provenance on conflicting art attributions mostly) that could be leveraged for this scope.

Babelscape org

If you want to leverage the pretraining, it is a good idea to preserve the format, but be aware that if only new data is used with new type of relations is used, you will run into catastrophic forgetting issues, and the model will probably only predict the new relations in the new training data. You can always combine your data with that of sredfm to avoid that, as I am sure that if you have triplets you could adapt them to follow a similar format.

About your specific use case/example, "performer" is not about what you point, but rather https://www.wikidata.org/wiki/Property:P175, i.e. actor, musician, band or other performer associated with this role or musical work, as all training data comes from Wikidata properties. The two examples you give are indeed errors/irrational, due to the fact that training data was quite different in nature, and one of the shortcomings of rebel/mrebel was that it is not trained to give "no predictions" when there are none to predict from within the relation types it is trained on. Instead, due to its autoregressive nature, it is "biased" (in the sense that at training time it was always given a sequence to generate) to give a prediction no matter what, leading to weird outputs. Perplexity can be a good tool to assess the "quality" of the generated sequence, as I expect it to be high for those two examples.

Hope this helps,
Pere-Lluis.

Sign up or log in to comment