dtaec-type-normalizer / README.md

aehrm

Update README.md

bf13bc3 verified 2 months ago

preview code

raw

history blame contribute delete

No virus

5.88 kB

	---
	datasets:
	- aehrm/dtaec-lexica
	language: de
	pipeline_tag: translation
	model-index:
	- name: aehrm/dtaec-type-normalizer
	results:
	- task:
	name: Historic Text Normalization (type-level)
	type: translation
	dataset:
	name: DTA EvalCorpus Lexicon
	type: aehrm/dtaec-lexicon
	split: dev
	metrics:
	- name: Word Accuracy
	type: accuracy
	value: 0.9546
	- name: Word Accuracy OOV
	type: accuracy
	value: 0.9096
	license: cc0-1.0
	---

	# DTAEC Type Normalizer

	This model is trained from scratch to normalize historic spelling of German to contemporary one. It is type-based, which means that it takes only a single token (without whitespace) as input, and generates the normalized variant.
	It achieves the following results on the evaluation set:
	- Loss: 0.0308
	- Wordacc: 0.9546
	- Wordacc Oov: 0.9096

	Note: This model is part of a larger system, which uses an additional GPT-based model to disambiguate different normalization forms by taking in the full context. See <https://github.com/aehrm/hybrid_textnorm>.

	## Training and evaluation data

	The model has been trained on the DTA-EC Parallel Corpus Lexicon ([aehrm/dtaec-lexica](https://huggingface.co/datasets/aehrm/dtaec-lexicon)), which is from a [parallel corpus](https://kaskade.dwds.de/~moocow/software/dtaec/) of the Deutsche Textarchiv (German Text Archive), who aligned historic prints of documents with their moden editions in contemporary orthography.

	Training was done on type-level, where, given the historic form of a type, the model must predict the corresponding normalized type that appeared most frequent in the parallel corpus.

	## Demo Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained('aehrm/dtaec-type-normalizer')
	model = AutoModelForSeq2SeqLM.from_pretrained('aehrm/dtaec-type-normalizer')

	# Note: you CANNOT normalize full sentences, only word for word!
	model_in = tokenizer(['Freyheit', 'seyn', 'ſelbstthätig'], return_tensors='pt', padding=True)
	model_out = model.generate(**model_in)

	print(tokenizer.batch_decode(model_out, skip_special_tokens=True))
	# >>> ['Freiheit', 'sein', 'selbsttätig']
	```

	Or, more compact using the huggingface `pipeline`:

	```python
	from transformers import pipeline

	pipe = pipeline(model="aehrm/dtaec-type-normalizer")
	out = pipe(['Freyheit', 'seyn', 'ſelbstthätig'])

	print(out)
	# >>> [{'generated_text': 'Freiheit'}, {'generated_text': 'sein'}, {'generated_text': 'selbsttätig'}]
	```


	## Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 8
	- eval_batch_size: 64
	- seed: 12345
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 20

	## Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wordacc \| Wordacc Oov \| Gen Len \|
	\|:-------------:\|:-----:\|:------:\|:---------------:\|:-------:\|:-----------:\|:-------:\|
	\| 0.0912 \| 1.0 \| 12628 \| 0.0698 \| 0.8984 \| 0.8421 \| 12.3456 \|
	\| 0.0746 \| 2.0 \| 25256 \| 0.0570 \| 0.9124 \| 0.8584 \| 12.3442 \|
	\| 0.0622 \| 3.0 \| 37884 \| 0.0493 \| 0.9195 \| 0.8717 \| 12.3512 \|
	\| 0.0584 \| 4.0 \| 50512 \| 0.0465 \| 0.9221 \| 0.8749 \| 12.3440 \|
	\| 0.0497 \| 5.0 \| 63140 \| 0.0436 \| 0.9274 \| 0.8821 \| 12.3552 \|
	\| 0.0502 \| 6.0 \| 75768 \| 0.0411 \| 0.9311 \| 0.8858 \| 12.3519 \|
	\| 0.0428 \| 7.0 \| 88396 \| 0.0396 \| 0.9336 \| 0.8878 \| 12.3444 \|
	\| 0.0416 \| 8.0 \| 101024 \| 0.0372 \| 0.9339 \| 0.8887 \| 12.3471 \|
	\| 0.042 \| 9.0 \| 113652 \| 0.0365 \| 0.9396 \| 0.8944 \| 12.3485 \|
	\| 0.0376 \| 10.0 \| 126280 \| 0.0353 \| 0.9412 \| 0.8962 \| 12.3485 \|
	\| 0.031 \| 11.0 \| 138908 \| 0.0339 \| 0.9439 \| 0.9008 \| 12.3519 \|
	\| 0.0298 \| 12.0 \| 151536 \| 0.0337 \| 0.9454 \| 0.9013 \| 12.3479 \|
	\| 0.0302 \| 13.0 \| 164164 \| 0.0322 \| 0.9470 \| 0.9043 \| 12.3483 \|
	\| 0.0277 \| 14.0 \| 176792 \| 0.0316 \| 0.9479 \| 0.9040 \| 12.3506 \|
	\| 0.0277 \| 15.0 \| 189420 \| 0.0323 \| 0.9488 \| 0.9030 \| 12.3514 \|
	\| 0.0245 \| 16.0 \| 202048 \| 0.0314 \| 0.9513 \| 0.9072 \| 12.3501 \|
	\| 0.0235 \| 17.0 \| 214676 \| 0.0313 \| 0.9520 \| 0.9071 \| 12.3511 \|
	\| 0.0206 \| 18.0 \| 227304 \| 0.0310 \| 0.9531 \| 0.9084 \| 12.3502 \|
	\| 0.0178 \| 19.0 \| 239932 \| 0.0307 \| 0.9545 \| 0.9094 \| 12.3507 \|
	\| 0.016 \| 20.0 \| 252560 \| 0.0308 \| 0.9546 \| 0.9096 \| 12.3516 \|


	### Framework versions

	- Transformers 4.41.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1

	### License

	The model weights are marked with [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/?ref=chooser-v1).

	NOTE: This model and its inferences or derivative works may be considered an Adaptation of
	- the DTA EvalCorpus by Bryan Jurish, Henriette Ast, Marko Drotschmann, and Christian Thomas, licensed under the [Creative Commons Attribution-NonCommercial 3.0 Unported License](http://creativecommons.org/licenses/by-nc/3.0/),
	- historical source text by the Deutsche Textarchiv, licensed under the [Creative Commons Attribution-NonCommercial 3.0 Unported License](http://creativecommons.org/licenses/by-nc/3.0/),
	- contemporary target text by TextGrid, licensed under the [Creative Commons Attribution 3.0 Unported License](http://creativecommons.org/licenses/by/3.0/),
	- contemporary target text by Project Gutenberg, licensed under the [Project Gutenberg License](https://www.gutenberg.org/policy/license.html).

	Conditions on attribution and/or restrictions to commercial use may apply.