NeTS-lab commited on
Commit
be74f1b
·
verified ·
1 Parent(s): 91b21ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -3
README.md CHANGED
@@ -1,3 +1,174 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ metrics:
4
+ - accuracy
5
+ ---
6
+
7
+ # BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)
8
+
9
+ ## Model Description
10
+
11
+ This is a GPT-2 language model trained adapting the baseline model built for the **BabyLM 2025 Challenge**.
12
+
13
+ - **Developed by:** NeTS Lab
14
+ - **Model type:** Autoregressive Language Model (GPT-2 architecture)
15
+ - **Language(s):** Italian
16
+ - **License:** MIT
17
+ - **Parent Model:** GPT-2
18
+ - **Tokenizer:** BPE
19
+
20
+ ## Key Features
21
+
22
+ - **Strict data constraints** (3M words) child-directed speech corpus
23
+ - **Optimized for data efficiency** default BabyLM 2025 baseline hyperparameter tuning
24
+ - **768-dimensional embeddings** with 12 attention heads and 12 layers
25
+
26
+ ## Model Details
27
+
28
+ ### Architecture
29
+ - **Base Architecture:** GPT-2 (12 layers, 12 attention heads)
30
+ - **Hidden Size:** 768
31
+ - **Vocabulary Size:** ~~16K
32
+ - **Context Length:** 1,024 tokens
33
+ - **Parameters:** ~~104M (estimated)
34
+
35
+ ### Training Configuration
36
+ - **Training Type:** Strict (BabyLM 2025 guidelines)
37
+ - **Dataset Size:** 3M words maximum
38
+ - **Sequence Length:** 512 tokens
39
+ - **Batch Size:** 16
40
+ - **Learning Rate:** 5e-5
41
+ - **Training Steps:** 200,000
42
+ - **Warmup Steps:** 2,000
43
+ - **Epochs:** 10
44
+ - **Weight Decay:** 0.0
45
+ - **Gradient Clipping:** 1.0
46
+
47
+
48
+ ## Training Data
49
+
50
+ The model was trained on a small italian dataset (Fusco et al. 2024), which includes:
51
+ - **Size:** 3M words maximum
52
+ - **Sources:** Child-directed speech and age-appropriate text
53
+ - **Language:** Italian
54
+
55
+ ## Intended Uses
56
+
57
+ ### Primary Use Cases
58
+ - **Research** into data-efficient language modeling
59
+ - **Comparative studies** of tokenization methods in low-resource settings
60
+ - **Baseline model** for BabyLM 2025 Challenge participants
61
+
62
+ ### Out-of-Scope Uses
63
+ - **Production deployments** requiring robust, general-purpose language understanding
64
+ - **Safety-critical applications**
65
+ - **Tasks requiring knowledge beyond the training data scope**
66
+
67
+ ## Performance
68
+
69
+ The model was trained following BabyLM 2025 Challenge protocols:
70
+ - **Training loss:** 2.51947
71
+ - **Convergence:** Achieved after 200,000 training steps
72
+
73
+ ## Usage
74
+
75
+ ### Loading the Model
76
+
77
+ ```python
78
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
79
+
80
+ # Load model and tokenizer
81
+ model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
82
+ tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
83
+
84
+ # Generate text
85
+ input_text = "Il bambino gioca con"
86
+ inputs = tokenizer.encode(input_text, return_tensors="pt")
87
+ outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
88
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
89
+ print(generated_text)
90
+ ```
91
+
92
+ ### Text Generation Parameters
93
+ - **Max Length:** 50 tokens (default)
94
+ - **Sampling:** Enabled by default
95
+ - **Temperature:** Adjustable (0.8 recommended)
96
+
97
+ ## Limitations and Biases
98
+
99
+ ### Known Limitations
100
+ - **Limited training data** (3M words) may result in knowledge gaps
101
+ - **Domain specificity** due to child-directed speech focus
102
+ - **Context window** limited to 1,024 tokens
103
+
104
+ ### Potential Biases
105
+ - **Age-appropriate content bias** from training data selection
106
+ - **Italian language bias** (monolingual training)
107
+ - **Morphological bias** toward Indo-European language patterns
108
+
109
+ ## Technical Specifications
110
+
111
+ ### Training Infrastructure
112
+ - **Framework:** PyTorch + Transformers
113
+ - **Precision:** float32
114
+ - **Gradient Accumulation:** Configured for effective batch size
115
+ - **Monitoring:** Weights & Biases integration
116
+
117
+ ### Model Configuration
118
+ ```json
119
+ {
120
+ "activation_function": "gelu_new",
121
+ "architectures": ["GPT2LMHeadModel"],
122
+ "attn_pdrop": 0.1,
123
+ "embd_pdrop": 0.1,
124
+ "layer_norm_epsilon": 1e-05,
125
+ "n_ctx": 1024,
126
+ "n_embd": 768,
127
+ "n_head": 12,
128
+ "n_layer": 12,
129
+ "vocab_size": 16384
130
+ }
131
+ ```
132
+
133
+ ## Citation
134
+
135
+ If you use this model in your research, please cite:
136
+
137
+ ```bibtex
138
+ @inproceedings{fusco-etal-2024-recurrent,
139
+ title = "Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-{LM} Training on Child-Directed Speech in {I}talian",
140
+ author = "Fusco, Achille and
141
+ Barbini, Matilde and
142
+ Piccini Bianchessi, Maria Letizia and
143
+ Bressan, Veronica and
144
+ Neri, Sofia and
145
+ Rossi, Sarah and
146
+ Sgrizzi, Tommaso and
147
+ Chesi, Cristiano",
148
+ editor = "Dell'Orletta, Felice and
149
+ Lenci, Alessandro and
150
+ Montemagni, Simonetta and
151
+ Sprugnoli, Rachele",
152
+ booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
153
+ month = dec,
154
+ year = "2024",
155
+ address = "Pisa, Italy",
156
+ publisher = "CEUR Workshop Proceedings",
157
+ url = "https://aclanthology.org/2024.clicit-1.46/",
158
+ pages = "382--389",
159
+ ISBN = "979-12-210-7060-6"
160
+ }
161
+ ```
162
+
163
+ ## Acknowledgments
164
+
165
+ - **BabyLM 2025 Challenge** organizers for providing the framework
166
+ - **Hugging Face Transformers** team for the modeling infrastructure
167
+
168
+ ## Contact
169
+
170
+ For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].
171
+
172
+ ---
173
+
174
+ *This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.*