Livyatan's picture
Update README.md
a176083

datasets: https://github.com/omrikeren/ParaShoot/

metrics: f1 49.612 exact_match 26.439

language: he

pipeline_tag: question-answering

license: unknown

mT5-small-Hebrew-ParaShoot-QA

This repository contains a fine-tuned mT5-small (Multilingual Text-to-Text Transfer Transformer) model on the ParaShoot dataset (github). To enhance its performance, a "domain-specific" fine-tuning approach was employed. Initially, the model was pretrained on a Hebrew dataset to capture Hebrew linguistic nuances. Subsequently, I further fine-tuned the model on the ParaShoot dataset, aiming to improve its proficiency in the Question-Answering task. This model builds upon the original work by imvladikon who initially fine-tuned the mT5-small model for the summarization task.

Model Details

Google's mT5

mT5 is pretrained on the mC4 corpus, covering 101 languages. Note: mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task.

Related papers:

Paper: mT5: A massively multilingual pre-trained text-to-text transformer Authors: Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel

Paper: Multilingual Sequence-to-Sequence Models for Hebrew NLP Authors: Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty

Paper: PARASHOOT: A Hebrew Question Answering Dataset Authors: Omri Keren, Omer Levy

This model achieves the following results on the test set:

  • Overall F1: 49.612
  • Overall EM: 26.439
  • Loss: 1.346

Note: In the paper Multilingual Sequence-to-Sequence Models for Hebrew NLP the results were F1 - 48.71, EM - 24.52.

How to use the model:

Use the code below to get started with the model.

from transformers import MT5ForConditionalGeneration, AutoTokenizer
MODEL_NAME = "Livyatan/mT5-small-Hebrew-ParaShoot-QA"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = MT5ForConditionalGeneration.from_pretrained(MODEL_NAME)
def generate_answer(question, context):
input_encoding = tokenizer(
question,
context,
max_length = len(context),
padding="max_length",
truncation="only_second",
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt"
).to(DEVICE)

with torch.no_grad():
generated_ids = model.generate(
input_ids = input_encoding['input_ids'].to(DEVICE),
attention_mask = input_encoding['attention_mask'].to(DEVICE),
max_length=20,
)

preds = [
tokenizer.decode(generated_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
for generated_id in generated_ids
]

return "".join(preds)

context = '住讚专转 讛诇讜讜讬讬转谞讗讬诐 讻讜诇诇转 讻-90 诪讬谞讬诐, 砖讻讜诇诐 讞讬讬诐 讘讗讜拽讬讬谞讜住讬诐 诪诇讘讚 讞诪讬砖讛 诪讬谞讬 讚讜诇驻讬谞讬诐 讛讞讬讬诐 讘诪讬诐 诪转讜拽讬诐. 讛诇讜讜讬讬转谞讗讬诐 讛讞讬讬诐 诪讞讜诇拽讬诐 诇砖转讬 转转-住讚专讜转: 诇讜讜讬讬转谞讬 诪讝讬驻讜转 (Mysticeti) 讜诇讜讜讬讬转谞讬 砖讬谞讬讬诐 (Odontoceti; 讜讘讛诐 讙诐 讚讜诇驻讬谞讬诐); 讘注讘专 讛转拽讬讬诪讛 转转-住讚专讛 谞讜住驻转: 诇讜讜讬讬转谞讬诐 拽讚讜诪讬诐 (Archaeoceti), 砖谞讻讞讚讛. 讘诪专讘讬转 讛诪拽专讬诐 诇讜讜讬讬转谞讬 讛诪讝讬驻讜转 讙讚讜诇讬诐 讘讗讜驻谉 诪砖诪注讜转讬 诪诇讜讜讬讬转谞讬 讛砖讬谞讬讬诐, 讛拽讟谞讬诐 讜讛诪讛讬专讬诐 讬讜转专, 讜讻诪讛 诪诇讜讜讬讬转谞讬 讛诪讝讬驻讜转 讛诐 诪讘注诇讬 讛讞讬讬诐 讛讙讚讜诇讬诐 讘讬讜转专 讘讻讚讜专 讛讗专抓. 诇讜讜讬讬转谞讬 讛砖讬谞讬讬诐 诪转讗驻讬讬谞讬诐 讘砖讬谞讬讬诐 讞讚讜转, 讜讛诐 爪讬讬讚讬诐 诪讛讬专讬诐 砖谞讬讝讜谞讬诐 诪讚讙讬诐 讜诪讬爪讜专讬诐 讬诪讬讬诐 讗讞专讬诐. 诇注讜诪转诐 诇讜讜讬讬转谞讬 讛诪讝讬驻讜转 讛诐 讞住专讬 砖讬谞讬讬诐 讜讘诪拽讜诐 讝讗转 讬砖 诇讛诐 诪讝讬驻讜转 讗专讜讻讜转 讚诪讜讬讜转 诪住谞谞转, 砖讘注讝专转谉 讛诐 诪住谞谞讬诐 驻诇谞拽讟讜谉 诪讛诪讬诐.'
question = '讻诪讛 诪讬谞讬诐 讻讜诇诇转 住讚专转 讛诇讜讜讬讬转谞讗讬诐?'
answer = generate_answer(question, context)
print(answer)
>>> '讻-90 诪讬谞讬诐'