File size: 5,669 Bytes
9e2a3f3
addb83f
a68a90d
addb83f
 
 
 
 
 
 
 
 
054bc72
 
addb83f
 
67ff02c
e53573f
addb83f
 
 
 
 
 
 
 
 
 
 
 
 
 
343bf4a
addb83f
2f59c7f
f792ebf
 
6221020
d181bd7
 
 
f792ebf
 
 
addb83f
d181bd7
 
addb83f
 
646133d
 
 
 
fe1a154
 
 
 
 
646133d
 
67ff02c
646133d
 
 
 
 
 
 
 
 
 
 
9533e9d
 
646133d
 
 
 
 
 
 
67ff02c
646133d
 
 
 
 
 
cfb1c97
646133d
 
 
addb83f
 
 
3e1f3f8
67ff02c
addb83f
 
 
 
 
 
 
16e5a28
 
67ff02c
141cca1
1585b5e
141cca1
addb83f
141cca1
addb83f
646133d
addb83f
 
 
 
 
bb771b3
26c7565
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
## Persian XLM-RoBERTA Large For Question Answering Task

XLM-RoBERTA is a multilingual language model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116v2) by Conneau et al. .

Multilingual [XLM-RoBERTa large for QA on various languages](https://huggingface.co/deepset/xlm-roberta-large-squad2) is fine-tuned on various QA datasets but PQuAD, which is the biggest persian QA dataset so far. This second model is our base model to be fine-tuned.

Paper presenting PQuAD dataset: [arXiv:2202.06219](https://arxiv.org/abs/2202.06219)

---

## Introduction

This model is fine-tuned on PQuAD Train set and is easily ready to use.
Its very long training time encouraged me to publish this model in order to make life easier for those who need.


## Hyperparameters of training
I set batch size to 4 due to the limitations of GPU memory in Google Colab.
```
batch_size = 4
n_epochs = 1
base_LM_model = "deepset/xlm-roberta-large-squad2"
max_seq_len = 256
learning_rate = 3e-5
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate = 3e-5,
warmup_ratio = 0.1,
gradient_accumulation_steps = 8,
weight_decay = 0.01,
``` 
## Performance
Evaluated on the PQuAD Persian test set with the [official PQuAD link](https://huggingface.co/datasets/newsha/PQuAD).
I trained for more than 1 epoch as well, but I get worse results.
 Our XLM-Roberta outperforms [our ParsBert on PQuAD](https://huggingface.co/pedramyazdipoor/parsbert_question_answering_PQuAD), but the former is more than 3 times bigger than the latter one; so comparing these two is not fair.
### Question Answering On Test Set of PQuAD Dataset

|      Metric      | Our XLM-Roberta Large| Our ParsBert  |
|:----------------:|:--------------------:|:-------------:|
| Exact Match      |   66.56*             | 47.44         |
|      F1          |   87.31*             | 81.96         |



## How to use

## Pytorch
```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
path = 'pedramyazdipoor/persian_xlm_roberta_large'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForQuestionAnswering.from_pretrained(path)
```
## Inference 
There are some considerations for inference:
1) Start index of answer must be smaller than end index.
2) The span of answer must be within the context.
3) The selected span must be the most probable choice among N pairs of candidates.

```python
def generate_indexes(start_logits, end_logits, N, min_index):
  
  output_start = start_logits
  output_end = end_logits

  start_indexes = np.arange(len(start_logits))
  start_probs = output_start
  list_start = dict(zip(start_indexes, start_probs.tolist()))
  end_indexes = np.arange(len(end_logits))
  end_probs = output_end
  list_end = dict(zip(end_indexes, end_probs.tolist()))

  sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability
  sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True)

  final_start_idx, final_end_idx = [[] for l in range(2)]

  start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0])
  for a in range(0,N):
    for b in range(0,N):
      if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob :
        if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_start_list[a][0] > min_index) :
          prob = sorted_start_list[a][1] + sorted_end_list[b][1]
          start_idx = sorted_start_list[a][0]
          end_idx = sorted_end_list[b][0]
  final_start_idx.append(start_idx)    
  final_end_idx.append(end_idx)      

  return final_start_idx[0], final_end_idx[0]
```

```python
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval().to(device)
text = 'سلام من پدرامم 26 سالمه'
question = 'چند سالمه؟'
encoding = tokenizer(question,text,add_special_tokens = True,
                     return_token_type_ids = True,
                     return_tensors = 'pt',
                     padding = True,
                     return_offsets_mapping = True,
                     truncation = 'only_first',
                     max_length = 32)
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
#we had to change some pieces of code to make it compatible with one answer generation at a time
#If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens.
#you can initialize min_index in generate_indexes() to put force on tokens being chosen to be within the context(startindex must be greater than seperator token).
answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0)
print(tokenizer.tokenize(text + question))
print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)])
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟']
>>> ['▁26']
```

## Acknowledgments
We hereby, express our gratitude to the [Newsha Shahbodaghkhan](https://huggingface.co/datasets/newsha/PQuAD/tree/main) for facilitating dataset gathering.
## Contributors
- Pedram Yazdipoor : [Linkedin](https://www.linkedin.com/in/pedram-yazdipour/)
## Releases
### Release v0.2 (Sep 18, 2022)
This is the second version of our Persian XLM-Roberta-Large.
There were some problems using the previous version.