bueble-lm-2b / README.md
pdelobelle's picture
Create README.md
c3e5846 verified
|
raw
history blame
2.51 kB
metadata
language:
  - de
tags:
  - german
  - causal-lm
  - text-generation
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0

BübleLM

BübleLM Logo

BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.

Model Details

  • Architecture: Based on Gemma-2B
  • Parameters: 2 billion
  • Training: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
  • Context Length: Same as Gemma-2B
  • Input: Text (German)
  • Output: Text (German)

Training Data

Trained on 3.5B tokens from Occiglot-FineWeb project, including:

  • Contemporary web content (OSCAR 2015-2023)
  • Legislative documents (EurLex, ParlamInt)
  • News data (Tagesschau)
  • Wiki sources

Data sampling weights:

  • Wikipedia: 4x
  • News/Parliamentary: 2x
  • Other sources: 1x

Performance

[INSERT FIGURE: Performance comparison across models]

Key improvements over Gemma-2B baseline:

  • HellaSwag-DE: +71% (47.9% vs 28.0%)
  • ARC-DE: +41% (32.3% vs 22.9%)
  • Average zero-shot: +40% (35.8% vs 25.5%)

Safety & Ethics

Toxicity

  • Score: 52.97 on German TextDetox dataset
  • Toxic content appears more out-of-distribution compared to baseline

Gender Bias

  • Evaluated using perplexity differences between traditional and gender-inclusive forms
  • Slight preference for gender-inclusive language (not statistically significant)
  • Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
model = AutoModelForCausalLM.from_pretrained(
    "flair/bueble-lm-2b",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

Limitations

  • Limited vocabulary size (20k tokens) compared to multilingual models
  • Performance may vary on specialized domains not well-represented in training data
  • Model inherits base limitations from Gemma architecture

Citation