alanakbik commited on
Commit
bf6ea27
·
verified ·
1 Parent(s): 8eda4a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -26
README.md CHANGED
@@ -4,25 +4,42 @@ language:
4
  - de
5
  pipeline_tag: text-generation
6
  library_name: transformers
 
 
 
 
 
7
  ---
 
8
  # Boldt-DC-1B
9
 
10
- <img src="logo.png" width="500">
11
 
12
- **Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes three models:
13
 
14
  - [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
15
- - **Boldt-DC-1B** *(this model)*
16
  - [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
17
  - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
 
19
- Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
 
 
 
 
 
 
 
20
 
21
- - Coherence: removes structurally fragmented or incoherent documents
22
- - Information Value: retains only content-rich and fact-bearing documents
23
- - Educational Quality: selects for pedagogical clarity and explanatory depth
24
 
25
- As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
 
 
 
 
 
 
26
 
27
  ## Usage
28
 
@@ -30,7 +47,7 @@ As a result, instead of single-pass pre-training on a large web corpus, Boldt mo
30
 
31
  ```python
32
  from transformers import AutoTokenizer, AutoModelForCausalLM
33
- model_name = "Boldt/Boldt-DC-1B"
34
  tokenizer = AutoTokenizer.from_pretrained(model_name)
35
  model = AutoModelForCausalLM.from_pretrained(model_name)
36
 
@@ -42,22 +59,32 @@ outputs = model.generate(**inputs, max_new_tokens=64)
42
 
43
  ## Evaluation
44
 
45
- We evaluate Boldt-DC-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). It comprises the German subset of [Global MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) and updated translations of widely used English benchmarks, produced using [Tower+ 72B](https://huggingface.co/Unbabel/Tower-Plus-72B) to address issues we identified in existing German benchmark translations.
46
- Despite being trained on a substantially smaller amount of data, Boldt-DC-1B outperforms other similarly-sized SLMs capable of German on our evaluation suite. It also performs competitively with larger-sized (around 2B) multilingual models.
47
-
48
- | Category | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
49
- |----------|--------|--------|------|-------|-------|--------|----------|------|------|
50
- | Ours | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
51
- | | **Boldt-DC-1B** | 200B | 31.06 | **35.99** | **57.30** | *48.69* | 42.80 | *48.48* | 44.05 |
52
- | | [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B) | 230B | **31.42** | *34.11* | *55.78* | **48.77** | *44.70* | **52.32** | **44.52** |
53
- | Reference models - 1B | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
54
- | | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
55
- | | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
56
- ||
57
- | Reference models - >1B | [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
58
- | | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | 34.17 | 37.49 | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
59
- | | [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
60
- | | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | 57.47 | 49.62 | 52.64 | 48.89 | 46.62 |
 
 
 
 
 
 
 
 
 
 
61
 
62
  ## Safety & Ethics
63
 
@@ -75,4 +102,3 @@ We have not conducted systematic model evaluations of toxicity, demographic bias
75
  url={https://arxiv.org/abs/2604.28075},
76
  }
77
  ```
78
-
 
4
  - de
5
  pipeline_tag: text-generation
6
  library_name: transformers
7
+ tags:
8
+ - text-generation
9
+ - nlp
10
+ - custom_code
11
+ - german
12
  ---
13
+
14
  # Boldt-DC-1B
15
 
16
+ <img src="logo.png" width="500" alt="Boldt Logo">
17
 
18
+ **Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our initial release includes four models:
19
 
20
  - [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
21
+ - **Boldt-DC-1B** *(this model)*
22
  - [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
23
  - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
24
 
25
+ ### Repetition over Diversity
26
+ The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
27
+
28
+ Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
29
+
30
+ - **Coherence:** Eliminates structurally fragmented or incoherent documents.
31
+ - **Information Value:** Isolates content-rich and fact-bearing texts.
32
+ - **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
33
 
34
+ We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
 
 
35
 
36
+ **Boldt-DC-1B** represents the highly optimized 1-billion parameter foundation of this methodology, trained over multiple epochs on 200B tokens of our extreme-signal dataset.
37
+
38
+ ## Model Architecture
39
+ - **Parameters:** ~1 Billion
40
+ - **Context Window:** 2048 tokens
41
+ - **Training Data:** German Dense-Core subset (FineWeb-2) [200B tokens]
42
+ - **Language:** German
43
 
44
  ## Usage
45
 
 
47
 
48
  ```python
49
  from transformers import AutoTokenizer, AutoModelForCausalLM
50
+ model_name = "Boldt/Boldt-1B"
51
  tokenizer = AutoTokenizer.from_pretrained(model_name)
52
  model = AutoModelForCausalLM.from_pretrained(model_name)
53
 
 
59
 
60
  ## Evaluation
61
 
62
+ ![Boldt-1B Performance Comparison](boldt_1b_evaluation.png)
63
+
64
+ We evaluate Boldt-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). See our paper [(Aynetdinov et al., 2026)](https://arxiv.org/abs/2604.28075) for details on the structural and translation corrections we performed.
65
+
66
+ Despite being trained on substantially fewer tokens, the Boldt-1B family outperforms other 1B-class models on German tasks and performs competitively with much larger multilingual models.
67
+
68
+ ### 1B Weight Class (Direct Comparison)
69
+ *Note: Bold text indicates the best score in the 1B category.*
70
+
71
+ | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
72
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
73
+ | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
74
+ | [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) | 200B | 31.06 | **35.99** | **57.30** | 48.69 | 42.80 | 48.48 | 44.05 |
75
+ | **Boldt-1B (this model)** | 230B | **31.42** | 34.11 | 55.78 | **48.77** | 44.70 | **52.32** | **44.52** |
76
+ | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
77
+ | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
78
+ | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
79
+
80
+ ### 1.7B - 2B Weight Class (Larger Reference Models)
81
+
82
+ | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
83
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
84
+ | [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
85
+ | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | **34.17** | **37.49** | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
86
+ | [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
87
+ | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | **57.47** | **49.62** | **52.64** | **48.89** | **46.62** |
88
 
89
  ## Safety & Ethics
90
 
 
102
  url={https://arxiv.org/abs/2604.28075},
103
  }
104
  ```