bert-base-multilingual-cased-finetuned-yiddish-experiment-3

This model is a fine-tuned version of bert-base-multilingual-cased on the None dataset. It achieves the following results on the evaluation set:

Loss: 1.4254

Model description

More information needed

Intended uses & limitations

Intended for use with a chatbot to correct raw Yiddish machine transcriptions, which have been generated by Transkribus

Training and evaluation data

Training dataset = Gavin model fine tuning_lines.csv

Training procedure

The training process described in Experiment 3 focuses on fine-tuning the pre-trained mBERT (multilingual BERT) model for improving raw handwritten text recognition (HTR). The fine-tuning dataset consists of raw HTR outputs paired with their human-corrected ground truth, as indicated in the line.csv file.

Key Parameters and Rationale:

1. Model Selection: The use of bert-base-multilingual-cased leverages the multilingual capabilities of BERT to accommodate the linguistic diversity likely present in the handwritten text dataset. This choice aligns well with the need to handle potentially mixed-language inputs or varying character distributions.

2. Data Handling:

The dataset is loaded and structured into columns for raw HTR text and its hand-corrected counterpart.

Tokenization is performed using the mBERT tokenizer, with a maximum sequence length of 64 tokens. This length balances capturing sufficient context while preventing memory overhead.

3. Training Configuration:

Batch Size and Gradient Accumulation: A batch size of 4 with a gradient accumulation step of 1 is chosen, likely due to the memory limitations of the L4 GPU, ensuring stable training while processing smaller data chunks.
Learning Rate and Weight Decay: A low learning rate of 5e-6 allows for gradual updates to the pre-trained weights, preserving the pre-trained linguistic knowledge while adapting to the new task. Weight decay is set to 0 to avoid penalizing model parameters unnecessarily for this specific task.
Gradient Clipping: The maximum gradient norm of 1 prevents exploding gradients, which could destabilize training given the small batch size and high learning rate sensitivity.
Warm-Up Steps: 300 warm-up steps allow the optimizer to start with smaller updates, reducing initial instability.
Epochs and Logging: The model is trained for 10 epochs with evaluation loss logged every 100 steps, providing a balance between sufficient training time and monitoring. Compute Setup:

The process was executed on an L4 GPU, which is optimized for such NLP workloads, providing efficient computation and faster training iterations.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 300
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss
11.143	0.2364	100	7.6591
4.1737	0.4728	200	2.2642
2.0579	0.7092	300	1.7710
1.6963	0.9456	400	1.6712
1.5705	1.1820	500	1.6379
1.5353	1.4184	600	1.6003
1.5213	1.6548	700	1.5273
1.4387	1.8913	800	1.5415
1.3973	2.1277	900	1.5530
1.4266	2.3641	1000	1.5328
1.3365	2.6005	1100	1.5154
1.4423	2.8369	1200	1.4662
1.3948	3.0733	1300	1.5041
1.3244	3.3097	1400	1.4530
1.3645	3.5461	1500	1.4656
1.329	3.7825	1600	1.4542
1.3326	4.0189	1700	1.5293
1.2768	4.2553	1800	1.4575
1.3125	4.4917	1900	1.4638
1.2925	4.7281	2000	1.4867
1.281	4.9645	2100	1.4827
1.2966	5.2009	2200	1.4359
1.28	5.4374	2300	1.4761
1.2436	5.6738	2400	1.5006
1.2787	5.9102	2500	1.4511
1.2344	6.1466	2600	1.4430
1.199	6.3830	2700	1.4254
1.2899	6.6194	2800	1.4339
1.2637	6.8558	2900	1.4609
1.2186	7.0922	3000	1.4300
1.181	7.3286	3100	1.4407
1.2815	7.5650	3200	1.4471
1.2161	7.8014	3300	1.4413
1.1562	8.0378	3400	1.4695
1.1668	8.2742	3500	1.4940
1.2557	8.5106	3600	1.4430
1.1985	8.7470	3700	1.4562
1.2051	8.9835	3800	1.4412
1.1588	9.2199	3900	1.4421
1.2002	9.4563	4000	1.4477
1.2339	9.6927	4100	1.4573
1.1918	9.9291	4200	1.4463

Framework versions

Transformers 4.47.0
Pytorch 2.5.1+cu121
Datasets 3.1.0
Tokenizers 0.21.0

MarineLives
/

mBert-finetuned-yiddish-experiment-3