yhavinga
/

gpt2-medium-dutch-nedd

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

gpt2-medium-dutch-nedd / README.md

yhavinga's picture

Update README

95795c3 almost 3 years ago

|

1.57 kB

	---
	language: nl
	widget:
	- text: "In het jaar 2030 zullen we"
	- text: "Toen ik gisteren volledig in de ban was van"
	- text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
	- text: "In Israël was een strenge lockdown"
	tags:
	- gpt2-medium
	- gpt2
	pipeline_tag: text-generation
	datasets:
	- yhavinga/mc4_nl_cleaned
	---
	# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱

	Datasets:

	* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), dataset config: full (33B tokens)
	* A recreation of the TBC but for the Dutch language (see e.g.
	https://github.com/sgraaf/Replicate-Toronto-BookCorpus)

	Tokenizer:

	* Tokenizer trained on mC4 with scripts from the Huggingface
	Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

	Training details:

	* Trained for 320k steps (30 dec 2021)
	* Block size: 512
	* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
	* Warmup steps: 5000
	* Weight decay: 0.01

	Further fine-tuned on a Dutch book corpus.

	Work in progress. Dec 2021-Jan2022

	* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
	* Thanks to @gsarti for creating the [t5-flax-gcp
	repository](https://github.com/gsarti/t5-flax-gcp).
	* Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
	[gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
	for sharing their training scripts!