Giyaseddin commited on
Commit
5272121
1 Parent(s): 4be5347

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md CHANGED
@@ -1,3 +1,165 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language: en
4
+ library: transformers
5
+ other: distilroberta
6
+ datasets:
7
+ - Short Question Answer Assessment Dataset
8
  ---
9
+
10
+ # DistilRoBERTa base model for Short Question Answer Assessment
11
+
12
+ ## Model description
13
+
14
+ The pre-trained model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased).
15
+ The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation).
16
+ This model is case-sensitive: it makes a difference between english and English.
17
+
18
+ The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base).
19
+ On average DistilRoBERTa is twice as fast as Roberta-base.
20
+
21
+ We encourage to check [RoBERTa-base model](https://huggingface.co/roberta-base) to know more about usage, limitations and potential biases.
22
+
23
+ \
24
+
25
+ This is a classification model that solves Short Question Answer Assessment task, finetuned [pretrained DistilRoBERTa model](https://huggingface.co/distilroberta-base) on
26
+ [Question Answer Assessment dataset](#)
27
+
28
+ ## Intended uses & limitations
29
+
30
+ This can only be used for the kind of questions and answers provided by that are similar to the ones in the dataset of [Banjade et al.](https://aclanthology.org/W16-0520.pdf).
31
+
32
+
33
+ ### How to use
34
+
35
+ You can use this model directly with a :
36
+
37
+ ```python
38
+ >>> from transformers import pipeline
39
+ >>> classifier = pipeline("text-classification", model="Giyaseddin/distilroberta-base-finetuned-short-answer-assessment", return_all_scores=True)
40
+ >>> context = "To rescue a child who has fallen down a well, rescue workers fasten him to a rope, the other end of which is then reeled in by a machine. The rope pulls the child straight upward at steady speed."
41
+ >>> question = "How does the amount of tension in the rope compare to the downward force of gravity acting on the child?"
42
+ >>> ref_answer = "Since the child is being raised straight upward at a constant speed, the net force on the child is zero and all the forces balance. That means that the tension in the rope balances the downward force of gravity."
43
+ >>> student_answer = "The tension force is higher than the force of gravity."
44
+ >>>
45
+ >>> body = " [SEP] ".join([context, question, ref_answer, student_answer])
46
+ >>> raw_results = classifier([body])
47
+ >>> raw_results
48
+ [[{'label': 'LABEL_0', 'score': 0.0004029414849355817},
49
+ {'label': 'LABEL_1', 'score': 0.0005476847873069346},
50
+ {'label': 'LABEL_2', 'score': 0.998059093952179},
51
+ {'label': 'LABEL_3', 'score': 0.0009902542224153876}]]
52
+ >>> _LABELS_ID2NAME = {0: "correct", 1: "correct_but_incomplete", 2: "contradictory", 3: "incorrect"}
53
+ >>> results = []
54
+ >>> for result in raw_results:
55
+ for score in result:
56
+ results.append([
57
+ {_LABELS_ID2NAME[int(score["label"][-1:])]: "%.2f" % score["score"]}
58
+ ])
59
+
60
+ >>> results
61
+ [[{'correct': '0.00'}],
62
+ [{'correct_but_incomplete': '0.00'}],
63
+ [{'contradictory': '1.00'}],
64
+ [{'incorrect': '0.00'}]]
65
+ ```
66
+
67
+ ### Limitations and bias
68
+
69
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
70
+ predictions. It also inherits some of
71
+ [the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).
72
+
73
+ This bias will also affect all fine-tuned versions of this model.
74
+
75
+ Also one of the limiations of this model is the length, longer sequences would lead to wrong predictions, due to the pre-processing phase (after concatentating the input sequences, the important student answer might be pruned!)
76
+
77
+ ## Pre-training data
78
+
79
+ ## Training data
80
+
81
+ The RoBERTa model was pretrained on the reunion of five datasets:
82
+ - [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
83
+ - [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
84
+ - [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
85
+ articles crawled between September 2016 and February 2019.
86
+ - [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
87
+ train GPT-2,
88
+ - [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
89
+ story-like style of Winograd schemas.
90
+
91
+ Together theses datasets weight 160GB of text.
92
+
93
+ ## Fine-tuning data
94
+
95
+ The annotated dataset consists of 900 students’ short constructed answers and their correctness in the given context. Four qualitative levels of correctness are defined, correct, correct-but-incomplete, contradictory and Incorrect.
96
+
97
+
98
+ ## Training procedure
99
+
100
+ ### Preprocessing
101
+
102
+ In the preprocessing phase, the following parts are concatenated: _question context_, _question_, _reference_answer_, and _student_answer_ using the separator `[SEP]`.
103
+ This makes the full text as:
104
+
105
+ ```
106
+ [CLS] Context Sentence [SEP] Question Sentence [SEP] Reference Answer Sentence [SEP] Student Answer Sentence [CLS]
107
+ ```
108
+
109
+ The data are splitted according to the following ratio:
110
+ - Training set 80%.
111
+ - Test set 20%.
112
+
113
+ Lables are mapped as: `{0: "correct", 1: "correct_but_incomplete", 2: "contradictory", 3: "incorrect"}`
114
+
115
+ ### Fine-tuning
116
+
117
+ The model was finetuned on GeForce GTX 960M for 20 minuts. The parameters are:
118
+
119
+ | Parameter | Value |
120
+ |:-------------------:|:-----:|
121
+ | Learning rate | 5e-5 |
122
+ | Weight decay | 0.01 |
123
+ | Training batch size | 8 |
124
+ | Epochs | 4 |
125
+
126
+ Here is the scores during the training:
127
+
128
+
129
+ | Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
130
+ |:----------:|:-------------:|:-----------------:|:----------:|:---------:|:----------:|:--------:|
131
+ | 1 | No log | 0.773334 | 0.713706 | 0.711398 | 0.746059 | 0.713706 |
132
+
133
+ | 2 | 1.069200 | 0.404932 | 0.885279 | 0.884592 | 0.886699 | 0.885279 |
134
+ | 3 | 0.473700 | 0.247099 | 0.931980 | 0.931675 | 0.933794 | 0.931980 |
135
+ | 3 | 0.228000 | 0.205577 | 0.954315 | 0.954210 | 0.955258 | 0.954315 |
136
+
137
+ ## Evaluation results
138
+
139
+ When fine-tuned on downstream task of Question Answer Assessment 4 class classification, this model achieved the following results:
140
+ (scores are rounded to 2 floating points)
141
+
142
+
143
+ | | precision | recall | f1-score | support |
144
+ |:------------------------:|:----------:|:-------:|:--------:|:-------:|
145
+ | _correct_ | 0.933 | 0.992 | 0.962 | 366 |
146
+ | _correct_but_incomplete_ | 0.976 | 0.934 | 0.954 | 257 |
147
+ | _contradictory_ | 0.938 | 0.929 | 0.933 | 113 |
148
+ | _incorrect_ | 0.975 | 0.932 | 0.953 | 249 |
149
+ | accuracy | - | - | 0.954 | 985 |
150
+ | macro avg | 0.955 | 0.947 | 0.950 | 985 |
151
+ | weighted avg | 0.955 | 0.954 | 0.954 | 985 |
152
+
153
+ Confusion matrix:
154
+
155
+
156
+ | Actual \ Predicted | _correct_ | _correct_but_incomplete_ | _contradictory_ | _incorrect_ |
157
+ |:------------------------:|:---------:|:------------------------:|:---------------:|:-----------:|
158
+ | _correct_ | 363 | 3 | 0 | 0 |
159
+ | _correct_but_incomplete_ | 14 | 240 | 0 | 3 |
160
+ | _contradictory_ | 5 | 0 | 105 | 3 |
161
+ | _incorrect_ | 7 | 3 | 7 | 232 |
162
+
163
+
164
+
165
+ The AUC score is: 'micro'= **0.9695** and 'macro': **0.9650**