Corsican RobertaTokenizer

Model Details

Model Description

We try to convert our dataset of 32K setences/words into a Corsican Tokenizer. The main usage was to use it for multiple task:

Fill in the mask
...

Developed by: Parolla Team
Model type: RobertaTokenizer
Language (NLP): Corsican
License: MIT
Finetuned from model [optional]:

Training

Training logs

   [52440/52440 1:22:42, Epoch 12/12]
Step 	Training Loss
500 	6.381300
1000 	6.437300
1500 	6.377700
2000 	6.300200
2500 	6.258800
3000 	6.237300
3500 	6.180200
4000 	6.173800
4500 	5.994900
5000 	5.810800
5500 	5.707800
6000 	5.893600
6500 	5.839200
7000 	5.685100
7500 	5.678000
8000 	5.603100
8500 	5.527700
9000 	5.514400
9500 	5.431100
10000 	5.361000
10500 	5.411500
11000 	5.381900
11500 	5.297600
12000 	5.250500
12500 	5.276900
13000 	5.338400
13500 	5.212400
14000 	5.186500
14500 	5.142100
15000 	5.136100
15500 	5.053000
16000 	5.029200
16500 	5.079200
17000 	4.997900
17500 	4.981500
18000 	4.903900
18500 	4.982700
19000 	4.995200
19500 	4.930600
20000 	4.793900
20500 	4.899000
21000 	4.831100
21500 	4.950000
22000 	4.866000
22500 	4.910700
23000 	4.725100
23500 	4.826900
24000 	4.782700
24500 	4.812400
25000 	4.701700
25500 	4.753000
26000 	4.756300
26500 	4.623100
27000 	4.739700
27500 	4.623200
28000 	4.582700
28500 	4.620300
29000 	4.659800
29500 	4.640600
30000 	4.576400
30500 	4.533000
31000 	4.505500
31500 	4.562300
32000 	4.613800
32500 	4.574300
33000 	4.655400
33500 	4.498400
34000 	4.466300
34500 	4.533500
35000 	4.527000
35500 	4.399500
36000 	4.419100
36500 	4.447800
37000 	4.422600
37500 	4.358000
38000 	4.424800
38500 	4.429200
39000 	4.352600
39500 	4.363300
40000 	4.362100
40500 	4.413200
41000 	4.297300
41500 	4.355300
42000 	4.332100
42500 	4.303100
43000 	4.303400
43500 	4.299000
44000 	4.383100
44500 	4.251600
45000 	4.330600
45500 	4.259000
46000 	4.273200
46500 	4.226100
47000 	4.205100
47500 	4.285500
48000 	4.322100
48500 	4.287600
49000 	4.231200
49500 	4.247200
50000 	4.252100
50500 	4.316100
51000 	4.188300
51500 	4.185700
52000 	4.232600

CPU times: user 1h 20min 10s, sys: 32.2 s, total: 1h 20min 42s
Wall time: 1h 22min 43s

TrainOutput(global_step=52440, training_loss=4.877706548369267, metrics={'train_runtime': 4962.5389, 'train_samples_per_second': 84.523, 'train_steps_per_second': 10.567, 'total_flos': 5036102603465856.0, 'train_loss': 4.877706548369267, 'epoch': 12.0})

Evaluation

Current version

We train on the same dataset but with more epochs (12) and got the folllowing result it's not better.

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./CorseBERTO",
    tokenizer="./CorseBERTO"
)
## Expected "Tandu facia guasgi notte."
fill_mask("Tandu facia guasgi <mask>.")


[{'score': 0.030648713931441307,
  'token': 1635,
  'token_str': ' fà',
  'sequence': 'Tandu facia guasgi fà.'},
 {'score': 0.01834064908325672,
  'token': 30963,
  'token_str': ' Orsu',
  'sequence': 'Tandu facia guasgi Orsu.'},
 {'score': 0.018018996343016624,
  'token': 2373,
  'token_str': ' tuttu',
  'sequence': 'Tandu facia guasgi tuttu.'},
 {'score': 0.014401531778275967,
  'token': 2298,
  'token_str': ' pocu',
  'sequence': 'Tandu facia guasgi pocu.'},
 {'score': 0.013076471164822578,
  'token': 8484,
  'token_str': ' quì',
  'sequence': 'Tandu facia guasgi quì.'}]

Baseline

For the first model we trained on 5 epochs and got the following result on the fill-mask task for the following sentence:

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./CorseBERTO",
    tokenizer="./CorseBERTO"
)
## Expected "Tandu facia guasgi notte."
fill_mask("Tandu facia guasgi <mask>.")
[{'score': 0.04370326176285744,
  'token': 330,
  'token_str': 'à',
  'sequence': 'Tandu facia guasgià.'},
 {'score': 0.025496652349829674,
  'token': 283,
  'token_str': ' di',
  'sequence': 'Tandu facia guasgi di.'},
 {'score': 0.014682493172585964,
  'token': 446,
  'token_str': ' à',
  'sequence': 'Tandu facia guasgi à.'},
 {'score': 0.013385714031755924,
  'token': 344,
  'token_str': ' in',
  'sequence': 'Tandu facia guasgi in.'},
 {'score': 0.013206811621785164,
  'token': 361,
  'token_str': ' hè',
  'sequence': 'Tandu facia guasgi hè.'}]

Model Card Contact

Parolla Team (https://www.parolla.chat)