--- language: - de tags: - german - causal-lm - text-generation library_name: transformers pipeline_tag: text-generation license: apache-2.0 --- # BübleLM
BübleLM Logo
BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties. ## Model Details - **Architecture**: Based on Gemma-2B - **Parameters**: 2 billion - **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k) - **Context Length**: Same as Gemma-2B - **Input**: Text (German) - **Output**: Text (German) ## Training Data Trained on 3.5B tokens from Occiglot-FineWeb project, including: - Contemporary web content (OSCAR 2015-2023) - Legislative documents (EurLex, ParlamInt) - News data (Tagesschau) - Wiki sources Data sampling weights: - Wikipedia: 4x - News/Parliamentary: 2x - Other sources: 1x ## Performance [INSERT FIGURE: Performance comparison across models] Key improvements over Gemma-2B baseline: - HellaSwag-DE: +71% (47.9% vs 28.0%) - ARC-DE: +41% (32.3% vs 22.9%) - Average zero-shot: +40% (35.8% vs 25.5%) ## Safety & Ethics ### Toxicity - Score: 52.97 on German TextDetox dataset - Toxic content appears more out-of-distribution compared to baseline ### Gender Bias - Evaluated using perplexity differences between traditional and gender-inclusive forms - Slight preference for gender-inclusive language (not statistically significant) - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61) ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b") model = AutoModelForCausalLM.from_pretrained( "flair/bueble-lm-2b", device_map="auto", torch_dtype=torch.bfloat16 ) messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") outputs = model.generate(**input_ids, max_new_tokens=256) print(tokenizer.decode(outputs[0])) ``` ## Limitations - Limited vocabulary size (20k tokens) compared to multilingual models - Performance may vary on specialized domains not well-represented in training data - Model inherits base limitations from Gemma architecture ## Citation ```bibtex ```