klasocki commited on
Commit
65977ce
1 Parent(s): 7918af7

Add training notebook

Browse files
README.md CHANGED
@@ -88,9 +88,11 @@ dataset:
88
  The results of our evaluation of the baseline model out of domain on the English wikitext-103-raw-v1 validation
89
  dataset are as follows:
90
 
91
- | precision | recall | F1 | support |
92
- |-----------|--------|------|---------|
93
- | 0.79 | 0.72 | 0.75 | 10079 |
 
 
94
 
95
  We treat each comma as one token instance, as opposed to the original paper, which NER-tags the whole multiple-token
96
  preceding words as comma class tokens.
@@ -100,17 +102,26 @@ In our approach, for each comma from the prediction text obtained from the model
100
  * If a comma from ground truth is not predicted, it counts as a false negative.
101
 
102
  ## Training
103
- While fine-tuning an encoder BERT-like pre-trained model for NER seems like the best approach to the problem,
104
- since it preserves the sentence structure and only focuses on commas,
105
- with limited GPU resources, we doubt we could beat the baseline model with a similar approach.
106
- We could fine-tune the baseline on our data, focusing on commas, and see if it brings any improvement.
107
 
108
- However, we thought that trying out pre-trained text-to-text or decoder-only LLMs for this task using PEFT could be
 
 
 
 
 
 
 
 
 
 
 
 
109
  interesting, and wanted to check if we have enough resources for low-rank adaptation or prefix-tuning.
 
 
 
110
 
111
- We adapt the code from [this tutorial](https://www.youtube.com/watch?v=iYr1xZn26R8) in order to fine-tune a
112
- [bloom LLM](https://huggingface.co/bigscience/bloom-560m) to our task using
113
- [LoRa](https://arxiv.org/pdf/2106.09685.pdf).
114
  However, even with the smallest model from the family, we struggled with CUDA memory errors using the free Google
115
  colab GPU quotas, and could only train with a batch size of two.
116
  After a short training, it seems the loss keeps fluctuating and the model is only able to learn to repeat the
@@ -118,8 +129,6 @@ original phrase back.
118
 
119
  If time permits, we plan to experiment with seq2seq pre-trained models, increasing gradient accumulation steps, and the
120
  percentage of
121
- data with commas.
122
- The latter could help since wikitext contains highly diverse data, with many rows being empty strings,
123
- headers, or short paragraphs.
124
 
125
 
 
88
  The results of our evaluation of the baseline model out of domain on the English wikitext-103-raw-v1 validation
89
  dataset are as follows:
90
 
91
+ | Model | precision | recall | F1 | support |
92
+ |----------|-----------|--------|------|---------|
93
+ | baseline | 0.79 | 0.72 | 0.75 | 10079 |
94
+ | ours* | 0.86 | 0.85 | 0.85 | 10079 |
95
+ *details of the fine-tuning process in the next section.
96
 
97
  We treat each comma as one token instance, as opposed to the original paper, which NER-tags the whole multiple-token
98
  preceding words as comma class tokens.
 
102
  * If a comma from ground truth is not predicted, it counts as a false negative.
103
 
104
  ## Training
105
+ The fine-tuned model can be found [here](https://huggingface.co/klasocki/roberta-large-lora-ner-comma-fixer).
 
 
 
106
 
107
+ To compare with the baseline, we fine-tune the same model, RoBERTa large, on the wikitext English dataset.
108
+ We use a similar approach, where we treat comma-fixing as a NER problem, and for each token predict whether a comma
109
+ should be inserted after it.
110
+
111
+ The biggest differences are the dataset, the fact that we focus on commas, and that we use [LoRa](https://arxiv.org/pdf/2106.09685.pdf)
112
+ for parameter-efficient fine-tuning of the base model.
113
+
114
+ The biggest advantage of this approach is that it preserves the input structure and only focuses on commas,
115
+ ensuring that nothing else will be changed and that the model will not have to learn repeating the input back in case
116
+ no commas should be inserted.
117
+
118
+
119
+ We have also thought that trying out pre-trained text-to-text or decoder-only LLMs for this task using PEFT could be
120
  interesting, and wanted to check if we have enough resources for low-rank adaptation or prefix-tuning.
121
+ While the model would have to learn to not change anything else than commas and the free-form could prove evaluation
122
+ to be difficult, this approach has added flexibility in case we decide we want to fix other errors in the future not
123
+ just commas.
124
 
 
 
 
125
  However, even with the smallest model from the family, we struggled with CUDA memory errors using the free Google
126
  colab GPU quotas, and could only train with a batch size of two.
127
  After a short training, it seems the loss keeps fluctuating and the model is only able to learn to repeat the
 
129
 
130
  If time permits, we plan to experiment with seq2seq pre-trained models, increasing gradient accumulation steps, and the
131
  percentage of
132
+ data with commas, and trying out artificially inserting mistaken commas as opposed to removing them in preprocessing.
 
 
133
 
134
 
notebooks/evaluation.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
notebooks/finetuning_commafixer_with_LoRa.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
setup.py CHANGED
@@ -21,8 +21,10 @@ setup(
21
  extras_require={
22
  'training': [
23
  'datasets==2.14.4',
 
 
24
  'seqeval',
25
- 'notebook'
26
  ],
27
  'test': [
28
  'pytest',
 
21
  extras_require={
22
  'training': [
23
  'datasets==2.14.4',
24
+ 'notebook',
25
+ 'peft==0.5.0',
26
  'seqeval',
27
+ 'evaluate==0.4.0'
28
  ],
29
  'test': [
30
  'pytest',