first

This model is a fine-tuned version of longformer-gottbert-base-8192-aw512- on the a 500 million token subset of the german parts of the OSCAR dataset. It achieves the following results on the custom evaluation set:

Loss: 1.4981

Model description

The weights of the model are initialized from the german version of Roberta gottbert-base. The local attention windows have a fixed size of 512 tokens across all layers. The maximum sequence length is 8192.

Intended uses & limitations

Longformer models enable processing long texts using a mixture of local attention on each subword token and task specific global attention on a subset of the tokens.

Training and evaluation data

The OSCAR dataset is freely avaible corpus of filtered web texts from the Common Crawl in various languages. We used the 2017 version of the dataset.

Training procedure

The model was trained with masked language modeling for 3 epochs on a customly created 500 million tokens subset of the german proportion of the OSCAR dataset. It was validated using 5% of the original subset.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 2
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 3.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
2.5636	0.1	500	2.2399
2.0426	0.2	1000	1.8841
1.9653	0.3	1500	1.7807
1.9422	0.4	2000	1.7206
1.9323	0.49	2500	1.6800
1.7587	0.59	3000	1.6507
1.7239	0.69	3500	1.6316
1.7452	0.79	4000	1.6137
1.7415	0.89	4500	1.5983
1.7733	0.99	5000	1.5830
1.7656	1.09	5500	1.5735
1.6543	1.19	6000	1.5643
1.7131	1.28	6500	1.5546
1.6456	1.38	7000	1.5503
1.716	1.48	7500	1.5422
1.806	1.58	8000	1.5377
1.8407	1.68	8500	1.5327
1.6371	1.78	9000	1.5278
1.6453	1.88	9500	1.5231
1.7754	1.98	10000	1.5214
1.7695	2.08	10500	1.5165
1.7109	2.17	11000	1.5138
1.6992	2.27	11500	1.5107
1.6707	2.37	12000	1.5097
1.6835	2.47	12500	1.5040
1.7171	2.57	13000	1.5041
1.7257	2.67	13500	1.4990
1.6287	2.77	14000	1.5017
1.7737	2.87	14500	1.4983
1.4002	2.96	15000	1.4992

Framework versions

Transformers 4.15.0
Pytorch 1.10.1+cu113
Datasets 1.17.0
Tokenizers 0.10.3

LennartKeller
/

longformer-gottbert-base-8192-aw512