---
language:
- de
tags:
- german
- causal-lm
- text-generation
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
---
# BübleLM
BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.
## Model Details
- **Architecture**: Based on Gemma-2B
- **Parameters**: 2 billion
- **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
- **Context Length**: Same as Gemma-2B
- **Input**: Text (German)
- **Output**: Text (German)
## Training Data
Trained on 3.5B tokens from Occiglot-FineWeb project, including:
- Contemporary web content (OSCAR 2015-2023)
- Legislative documents (EurLex, ParlamInt)
- News data (Tagesschau)
- Wiki sources
Data sampling weights:
- Wikipedia: 4x
- News/Parliamentary: 2x
- Other sources: 1x
## Performance
[INSERT FIGURE: Performance comparison across models]
Key improvements over Gemma-2B baseline:
- HellaSwag-DE: +71% (47.9% vs 28.0%)
- ARC-DE: +41% (32.3% vs 22.9%)
- Average zero-shot: +40% (35.8% vs 25.5%)
## Safety & Ethics
### Toxicity
- Score: 52.97 on German TextDetox dataset
- Toxic content appears more out-of-distribution compared to baseline
### Gender Bias
- Evaluated using perplexity differences between traditional and gender-inclusive forms
- Slight preference for gender-inclusive language (not statistically significant)
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
model = AutoModelForCausalLM.from_pretrained(
"flair/bueble-lm-2b",
device_map="auto",
torch_dtype=torch.bfloat16
)
messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```
## Limitations
- Limited vocabulary size (20k tokens) compared to multilingual models
- Performance may vary on specialized domains not well-represented in training data
- Model inherits base limitations from Gemma architecture
## Citation
```bibtex
```