File size: 3,154 Bytes
c0c608e
 
 
15fda87
c0c608e
15fda87
 
 
 
 
c0c608e
 
 
 
 
 
 
 
 
2e5a857
 
e1c6db2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0c608e
 
 
 
 
f43e254
c0c608e
 
 
 
 
 
 
 
 
26cbbfb
 
c0c608e
 
 
e1c6db2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language:
- de
license: bigscience-bloom-rail-1.0
library_name: transformers
tags:
- ggml
- bloom
datasets:
- oscar
pipeline_tag: text-generation
---

# BLOOM-CLP German (6.4B parameters)

This is a monolingual German language model trained using the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method based on [BLOOM-7b1](https://huggingface.co/bigscience/bloom-7b1).

You can try out the model at [European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/20825/try%20out/).

<span style="color:blue">UPDATE: We recently released an instruction-tuned version of this model: [malteos/bloom-6b4-clp-german-oasst-v0.1](https://huggingface.co/malteos/bloom-6b4-clp-german-oasst-v0.1)</span>.

### How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
set a seed for reproducibility:

```python
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='malteos/bloom-6b4-clp-german')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=3)

[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
 {'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
 {'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},]
```


## Training dataset

- ca. 50B German tokens
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
- German court decisions from [Open Legal Data](http://openlegaldata.io/)

## Code

- [BigScience's Megatron-Deepspeed fork](https://github.com/bigscience-workshop/Megatron-DeepSpeed)

## Hardware

- 32xA100-40GB GPUs
- 12.5 days
- [Tensorboard logs](https://huggingface.co/malteos/bloom-6b4-clp-german-logs/tensorboard)

## Evaluation

Validation PPL compared to from-scratch training (the lower the better):

<img alt="Tokens vs PPL" src="https://github.com/malteos/clp-transfer/raw/main/german-6b-ppl.png">

Additional evaluations can be found in [our paper](https://arxiv.org/abs/2301.09626).

## How to cite

If you are using our code or models, please cite [our paper](https://arxiv.org/abs/2301.09626):

```bibtex
@misc{Ostendorff2023clp,
  doi = {10.48550/ARXIV.2301.09626},
  author = {Ostendorff, Malte and Rehm, Georg},
  title = {Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning},
  publisher = {arXiv},
  year = {2023}
}

```

## License

[BigScience BLOOM RAIL 1.0](https://bigscience.huggingface.co/blog/the-bigscience-rail-license)