metadata
language:
- de
license: bigscience-bloom-rail-1.0
library_name: transformers
tags:
- ggml
- bloom
datasets:
- oscar
pipeline_tag: text-generation
BLOOM-CLP German (6.4B parameters)
This is a monolingual German language model trained using the CLP-Transfer method based on BLOOM-7b1.
You can try out the model at European Language Grid.
Training dataset
- ca. 50B German tokens
- Web-crawled content from the German subset OSCAR v22.01 (excluding content tagged as header, footer, noisy, or adult)
- Web-crawled content from the GC4 Corpus (including only the head and middle parts)
- Both Web-crawled datasets are deduplicated with Google's suffix array implementation
- German court decisions from Open Legal Data
Code
Hardware
- 32xA100-40GB GPUs
- 12.5 days
- Tensorboard logs
Evaluation
TBA (see paper)