|
## ParsBert Fine-Tuned for Question Answering Task |
|
|
|
ParsBERT is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words. |
|
|
|
In this project I fine-tuned [ParsBert](https://huggingface.co/HooshvareLab/bert-fa-base-uncased) on [PQuAD dataset](https://huggingface.co/datasets/newsha/PQuAD/tree/main) for extractive question answering task. |
|
|
|
Our source code is [here](https://github.com/pedramyazdipoor/ParsBert_QA_PQuAD). |
|
Paper presenting ParsBert : [arXiv:2005.12515](https://arxiv.org/abs/2005.12515). |
|
Paper presenting PQuAD dataset: [arXiv:2202.06219](https://arxiv.org/abs/2202.06219). |
|
--- |
|
|
|
## Introduction |
|
|
|
This model is fine-tuned on PQuAD Train set and is easily ready to use. Too long training time encouraged me to publish this model in order to make life easier for those who need. |
|
|
|
|
|
## Hyperparameters |
|
I set batch_size to 32 due to the limitations of GPU memory in Google Colab. |
|
|
|
``` |
|
batch_size = 32, |
|
n_epochs = 2, |
|
max_seq_len = 256, |
|
learning_rate = 5e-5 |
|
``` |
|
|
|
## Performance |
|
Evaluated on the PQuAD Persian test set. |
|
I trained for more than 2 epochs as well, but I get worse results. |
|
|
|
Our [XLM-Roberta Large](https://huggingface.co/pedramyazdipoor/persian_xlm_roberta_large) outperforms our ParsBert, but the former is more than 3 times bigger than the latter one; so comparing these two is not fair. |
|
### Question Answering On Test Set of PQuAD Dataset |
|
|
|
| Metric | Our XLM-Roberta Large| Our ParsBert | |
|
|:----------------:|:--------------------:|:-------------:| |
|
| Exact Match | 66.56* | 47.44 | |
|
| F1 | 87.31* | 81.96 | |
|
|
|
|
|
|
|
## How to use |
|
|
|
## Pytorch |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForQuestionAnswering |
|
path = 'pedramyazdipoor/parsbert_question_answering_PQuAD' |
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
model = AutoModelForQuestionAnswering.from_pretrained(path) |
|
``` |
|
|
|
## Tensorflow |
|
```python |
|
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering |
|
path = 'pedramyazdipoor/parsbert_question_answering_PQuAD' |
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
model = TFAutoModelForQuestionAnswering.from_pretrained(path) |
|
``` |
|
|
|
## Inference for pytorch |
|
I leave Inference for tensorflow as an excercise for you :) . |
|
|
|
There are some considerations for inference: |
|
1) Start index of answer must be smaller than end index. |
|
2) The span of answer must be within the context. |
|
3) The selected span must be the most probable choice among N pairs of candidates. |
|
|
|
```python |
|
def generate_indexes(start_logits, end_logits, N, max_index): |
|
|
|
output_start = start_logits |
|
output_end = end_logits |
|
|
|
start_indexes = np.arange(len(start_logits)) |
|
start_probs = output_start |
|
list_start = dict(zip(start_indexes, start_probs.tolist())) |
|
end_indexes = np.arange(len(end_logits)) |
|
end_probs = output_end |
|
list_end = dict(zip(end_indexes, end_probs.tolist())) |
|
|
|
sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability |
|
sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True) |
|
|
|
final_start_idx, final_end_idx = [[] for l in range(2)] |
|
|
|
start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0]) |
|
for a in range(0,N): |
|
for b in range(0,N): |
|
if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob : |
|
if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_end_list[a][0] < max_index) : |
|
prob = sorted_start_list[a][1] + sorted_end_list[b][1] |
|
start_idx = sorted_start_list[a][0] |
|
end_idx = sorted_end_list[b][0] |
|
final_start_idx.append(start_idx) |
|
final_end_idx.append(end_idx) |
|
|
|
return final_start_idx[0], final_end_idx[0] |
|
``` |
|
|
|
```python |
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
model.eval().to(device) |
|
text = 'سلام من پدرامم 26 سالمه' |
|
question = 'چند سالمه؟' |
|
encoding = tokenizer(text,question,add_special_tokens = True, |
|
return_token_type_ids = True, |
|
return_tensors = 'pt', |
|
padding = True, |
|
return_offsets_mapping = True, |
|
truncation = 'only_first', |
|
max_length = 32) |
|
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device)) |
|
#we had to change some pieces of code to make it compatible with one answer generation at a time |
|
#If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens. |
|
#you can initialize max_index in generate_indexes() to put force on tokens being chosen to be within the context(end index must be less than seperator token). |
|
answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0) |
|
print(tokenizer.tokenize(text + question)) |
|
print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)]) |
|
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟'] |
|
>>> ['▁26'] |
|
``` |
|
|
|
## Acknowledgments |
|
We did this project thanks to the fantastic job done by [HooshvareLab](https://huggingface.co/HooshvareLab/bert-fa-base-uncased). |
|
We also express our gratitude to [Newsha Shahbodaghkhan](https://huggingface.co/datasets/newsha/PQuAD/tree/main) for facilitating dataset gathering. |
|
|
|
## Contributors |
|
- Pedram Yazdipoor : [Linkedin](https://www.linkedin.com/in/pedram-yazdipour/) |
|
|
|
## Releases |
|
### Release v0.2 (Sep 19, 2022) |
|
This is the second version of our ParsBert for Question Answering on PQuAD. |
|
|