malteos
/

bloom-6b4-clp-german

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bloom-6b4-clp-german / README.md

malteos's picture

Update README.md

e1c6db2 over 1 year ago

|

2.94 kB

	---
	language:
	- de
	license: bigscience-bloom-rail-1.0
	library_name: transformers
	tags:
	- ggml
	- bloom
	datasets:
	- oscar
	pipeline_tag: text-generation
	---

	# BLOOM-CLP German (6.4B parameters)

	This is a monolingual German language model trained using the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method based on [BLOOM-7b1](https://huggingface.co/bigscience/bloom-7b1).

	You can try out the model at [European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/20825/try%20out/).

	### How to use

	You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
	set a seed for reproducibility:

	```python
	>>> from transformers import pipeline, set_seed
	>>> generator = pipeline('text-generation', model='malteos/bloom-6b4-clp-german')
	>>> set_seed(42)
	>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=3)

	[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
	{'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
	{'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},]
	```


	## Training dataset

	- ca. 50B German tokens
	- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
	- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
	- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
	- German court decisions from [Open Legal Data](http://openlegaldata.io/)

	## Code

	- [BigScience's Megatron-Deepspeed fork](https://github.com/bigscience-workshop/Megatron-DeepSpeed)

	## Hardware

	- 32xA100-40GB GPUs
	- 12.5 days
	- [Tensorboard logs](https://huggingface.co/malteos/bloom-6b4-clp-german-logs/tensorboard)

	## Evaluation

	Validation PPL compared to from-scratch training (the lower the better):

	<img alt="Tokens vs PPL" src="https://github.com/malteos/clp-transfer/raw/main/german-6b-ppl.png">

	Additional evaluations can be found in [our paper](https://arxiv.org/abs/2301.09626).

	## How to cite

	If you are using our code or models, please cite [our paper](https://arxiv.org/abs/2301.09626):

	```bibtex
	@misc{Ostendorff2023clp,
	doi = {10.48550/ARXIV.2301.09626},
	author = {Ostendorff, Malte and Rehm, Georg},
	title = {Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning},
	publisher = {arXiv},
	year = {2023}
	}

	```

	## License

	[BigScience BLOOM RAIL 1.0](https://bigscience.huggingface.co/blog/the-bigscience-rail-license)