File size: 5,069 Bytes
7fa2909
 
8ff6402
 
 
 
 
4bacc49
 
 
 
 
 
8ff6402
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fa2909
8ff6402
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
598fc75
8ff6402
 
 
 
 
 
 
 
 
 
 
 
 
55913ac
8ff6402
 
 
 
 
 
 
 
 
 
 
 
9985f20
 
8ff6402
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
language:
- it
datasets:
- squad_it
widget:
- text: quale libro fu scritto da alessandro manzoni?
  context: alessandro manzoni pubblicò la prima versione de i promessi sposi nel 1827
- text: in quali competizioni gareggia la ferrari?
  context: la scuderia ferrari è una squadra corse italiana di formula 1 con sede a maranello
- text: quale sport è riferito alla serie a?
  context: il campionato di serie a è la massima divisione professionistica del campionato italiano di calcio maschile
model-index:
- name: osiria/bert-italian-cased-question-answering
  results:
  - task:
      type: question-answering
      name: Question Answering
    dataset:
      name: squad_it
      type: squad_it
    metrics:
    - type: exact-match
      value: 0.6560
      name: Exact Match
    - type: f1
      value: 0.7716
      name: F1
pipeline_tag: question-answering
---

--------------------------------------------------------------------------------------------------

<body>
<span class="vertical-text" style="background-color:lightgreen;border-radius: 3px;padding: 3px;"></span>
<br>
<span class="vertical-text" style="background-color:orange;border-radius: 3px;padding: 3px;">    Task: Question Answering</span>
<br>
<span class="vertical-text" style="background-color:lightblue;border-radius: 3px;padding: 3px;">    Model: BERT</span>
<br>
<span class="vertical-text" style="background-color:tomato;border-radius: 3px;padding: 3px;">    Lang: IT</span>
<br>
<span class="vertical-text" style="background-color:lightgrey;border-radius: 3px;padding: 3px;">  Type: Uncased</span>
<br>
<span class="vertical-text" style="background-color:#CF9FFF;border-radius: 3px;padding: 3px;"></span>
</body>

--------------------------------------------------------------------------------------------------

<h3>Model description</h3>

This is a <b>BERT</b> <b>[1]</b> uncased model for the <b>Italian</b> language, fine-tuned for <b>Extractive Question Answering</b> on the [SQuAD-IT](https://huggingface.co/datasets/squad_it) dataset <b>[2]</b>

If you are looking for a more accurate (but slightly heavier) model, you can refer to: https://huggingface.co/osiria/deberta-italian-question-answering

<b>update: version 2.0</b>

The 2.0 version further improves the performances by exploiting a 2-phases fine-tuning strategy: the model is first fine-tuned on the English SQuAD v2 (1 epoch, 20% warmup ratio, and max learning rate of 3e-5) then further fine-tuned on the Italian SQuAD (2 epochs, no warmup, initial learning rate of 3e-5)

In order to maximize the benefits of the multilingual procedure, [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) is used as a pre-trained model. When the double fine-tuning is completed, the embedding layer is then compressed as in [bert-base-italian-uncased](https://huggingface.co/osiria/bert-base-italian-uncased) to obtain a mono-lingual model size


<h3>Training and Performances</h3>

The model is trained to perform question answering, given a context and a question (under the assumption that the context contains the answer to the question). It has been fine-tuned for Extractive Question Answering, using the SQuAD-IT dataset, for 2 epochs with a linearly decaying learning rate starting from 3e-5, maximum sequence length of 384 and document stride of 128.
<br>The dataset includes 54.159 training instances and 7.609 test instances

The performances on the test set are reported in the following table:

| EM | F1 |
| ------ | ------ |
| 65.60 | 77.16 |

Testing notebook: https://huggingface.co/osiria/bert-italian-uncased-question-answering/blob/main/osiria_bert_italian_uncased_qa_evaluation.ipynb

<h3>Quick usage</h3>

```python
from transformers import BertTokenizerFast, BertForQuestionAnswering
from transformers import pipeline

tokenizer = BertTokenizerFast.from_pretrained("osiria/bert-italian-uncased-question-answering")
model = BertForQuestionAnswering.from_pretrained("osiria/bert-italian-uncased-question-answering")
    
pipeline_qa = pipeline("question-answering", model = model, tokenizer = tokenizer)
pipeline_qa(context = "alessandro manzoni è nato a milano nel 1785", question = "dove è nato manzoni?")

{'score': 0.9905025959014893, 'start': 28, 'end': 34, 'answer': 'milano'}
```

<h3>References</h3>

[1] https://arxiv.org/abs/1810.04805

[2] https://link.springer.com/chapter/10.1007/978-3-030-03840-3_29

<h3>Limitations</h3>

This model was trained SQuAD-IT which is mainly a machine translated version of the original SQuAD v1.1. This means that the quality of the training set is limited by the machine translation.
Moreover, the model is meant to answer questions under the assumption that the required information is actually contained in the given context (which is the underlying assumption of SQuAD v1.1). 
If the assumption is violated, the model will try to return an answer in any case, which is going to be incorrect.

<h3>License</h3>

The model is released under <b>Apache-2.0</b> license