DTAEC Type Normalizer

This model is trained from scratch to normalize historic spelling of German to contemporary one. It is type-based, which means that it takes only a single token (without whitespace) as input, and generates the normalized variant. It achieves the following results on the evaluation set:

Loss: 0.0308
Wordacc: 0.9546
Wordacc Oov: 0.9096

Note: This model is part of a larger system, which uses an additional GPT-based model to disambiguate different normalization forms by taking in the full context. See https://github.com/aehrm/hybrid_textnorm.

Training and evaluation data

The model has been trained on the DTA-EC Parallel Corpus Lexicon (aehrm/dtaec-lexica), which is from a parallel corpus of the Deutsche Textarchiv (German Text Archive), who aligned historic prints of documents with their moden editions in contemporary orthography.

Training was done on type-level, where, given the historic form of a type, the model must predict the corresponding normalized type that appeared most frequent in the parallel corpus.

Demo Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('aehrm/dtaec-type-normalizer')
model = AutoModelForSeq2SeqLM.from_pretrained('aehrm/dtaec-type-normalizer')

# Note: you CANNOT normalize full sentences, only word for word!
model_in = tokenizer(['Freyheit', 'seyn', 'ſelbstthätig'], return_tensors='pt', padding=True)
model_out = model.generate(**model_in)

print(tokenizer.batch_decode(model_out, skip_special_tokens=True))
# >>> ['Freiheit', 'sein', 'selbsttätig']

Or, more compact using the huggingface pipeline:

from transformers import pipeline

pipe = pipeline(model="aehrm/dtaec-type-normalizer")
out = pipe(['Freyheit', 'seyn', 'ſelbstthätig'])

print(out)
# >>> [{'generated_text': 'Freiheit'}, {'generated_text': 'sein'}, {'generated_text': 'selbsttätig'}]

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 64
seed: 12345
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 20

Training results

Training Loss	Epoch	Step	Validation Loss	Wordacc	Wordacc Oov	Gen Len
0.0912	1.0	12628	0.0698	0.8984	0.8421	12.3456
0.0746	2.0	25256	0.0570	0.9124	0.8584	12.3442
0.0622	3.0	37884	0.0493	0.9195	0.8717	12.3512
0.0584	4.0	50512	0.0465	0.9221	0.8749	12.3440
0.0497	5.0	63140	0.0436	0.9274	0.8821	12.3552
0.0502	6.0	75768	0.0411	0.9311	0.8858	12.3519
0.0428	7.0	88396	0.0396	0.9336	0.8878	12.3444
0.0416	8.0	101024	0.0372	0.9339	0.8887	12.3471
0.042	9.0	113652	0.0365	0.9396	0.8944	12.3485
0.0376	10.0	126280	0.0353	0.9412	0.8962	12.3485
0.031	11.0	138908	0.0339	0.9439	0.9008	12.3519
0.0298	12.0	151536	0.0337	0.9454	0.9013	12.3479
0.0302	13.0	164164	0.0322	0.9470	0.9043	12.3483
0.0277	14.0	176792	0.0316	0.9479	0.9040	12.3506
0.0277	15.0	189420	0.0323	0.9488	0.9030	12.3514
0.0245	16.0	202048	0.0314	0.9513	0.9072	12.3501
0.0235	17.0	214676	0.0313	0.9520	0.9071	12.3511
0.0206	18.0	227304	0.0310	0.9531	0.9084	12.3502
0.0178	19.0	239932	0.0307	0.9545	0.9094	12.3507
0.016	20.0	252560	0.0308	0.9546	0.9096	12.3516

Framework versions

Transformers 4.41.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

License

The model weights are marked with CC0 1.0 Universal.

NOTE: This model and its inferences or derivative works may be considered an Adaptation of

the DTA EvalCorpus by Bryan Jurish, Henriette Ast, Marko Drotschmann, and Christian Thomas, licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License,
historical source text by the Deutsche Textarchiv, licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License,
contemporary target text by TextGrid, licensed under the Creative Commons Attribution 3.0 Unported License,
contemporary target text by Project Gutenberg, licensed under the Project Gutenberg License.

Conditions on attribution and/or restrictions to commercial use may apply.

aehrm
/

dtaec-type-normalizer