Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,174 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
metrics:
|
| 4 |
+
- accuracy
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)
|
| 8 |
+
|
| 9 |
+
## Model Description
|
| 10 |
+
|
| 11 |
+
This is a GPT-2 language model trained adapting the baseline model built for the **BabyLM 2025 Challenge**.
|
| 12 |
+
|
| 13 |
+
- **Developed by:** NeTS Lab
|
| 14 |
+
- **Model type:** Autoregressive Language Model (GPT-2 architecture)
|
| 15 |
+
- **Language(s):** Italian
|
| 16 |
+
- **License:** MIT
|
| 17 |
+
- **Parent Model:** GPT-2
|
| 18 |
+
- **Tokenizer:** BPE
|
| 19 |
+
|
| 20 |
+
## Key Features
|
| 21 |
+
|
| 22 |
+
- **Strict data constraints** (3M words) child-directed speech corpus
|
| 23 |
+
- **Optimized for data efficiency** default BabyLM 2025 baseline hyperparameter tuning
|
| 24 |
+
- **768-dimensional embeddings** with 12 attention heads and 12 layers
|
| 25 |
+
|
| 26 |
+
## Model Details
|
| 27 |
+
|
| 28 |
+
### Architecture
|
| 29 |
+
- **Base Architecture:** GPT-2 (12 layers, 12 attention heads)
|
| 30 |
+
- **Hidden Size:** 768
|
| 31 |
+
- **Vocabulary Size:** ~~16K
|
| 32 |
+
- **Context Length:** 1,024 tokens
|
| 33 |
+
- **Parameters:** ~~104M (estimated)
|
| 34 |
+
|
| 35 |
+
### Training Configuration
|
| 36 |
+
- **Training Type:** Strict (BabyLM 2025 guidelines)
|
| 37 |
+
- **Dataset Size:** 3M words maximum
|
| 38 |
+
- **Sequence Length:** 512 tokens
|
| 39 |
+
- **Batch Size:** 16
|
| 40 |
+
- **Learning Rate:** 5e-5
|
| 41 |
+
- **Training Steps:** 200,000
|
| 42 |
+
- **Warmup Steps:** 2,000
|
| 43 |
+
- **Epochs:** 10
|
| 44 |
+
- **Weight Decay:** 0.0
|
| 45 |
+
- **Gradient Clipping:** 1.0
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
## Training Data
|
| 49 |
+
|
| 50 |
+
The model was trained on a small italian dataset (Fusco et al. 2024), which includes:
|
| 51 |
+
- **Size:** 3M words maximum
|
| 52 |
+
- **Sources:** Child-directed speech and age-appropriate text
|
| 53 |
+
- **Language:** Italian
|
| 54 |
+
|
| 55 |
+
## Intended Uses
|
| 56 |
+
|
| 57 |
+
### Primary Use Cases
|
| 58 |
+
- **Research** into data-efficient language modeling
|
| 59 |
+
- **Comparative studies** of tokenization methods in low-resource settings
|
| 60 |
+
- **Baseline model** for BabyLM 2025 Challenge participants
|
| 61 |
+
|
| 62 |
+
### Out-of-Scope Uses
|
| 63 |
+
- **Production deployments** requiring robust, general-purpose language understanding
|
| 64 |
+
- **Safety-critical applications**
|
| 65 |
+
- **Tasks requiring knowledge beyond the training data scope**
|
| 66 |
+
|
| 67 |
+
## Performance
|
| 68 |
+
|
| 69 |
+
The model was trained following BabyLM 2025 Challenge protocols:
|
| 70 |
+
- **Training loss:** 2.51947
|
| 71 |
+
- **Convergence:** Achieved after 200,000 training steps
|
| 72 |
+
|
| 73 |
+
## Usage
|
| 74 |
+
|
| 75 |
+
### Loading the Model
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 79 |
+
|
| 80 |
+
# Load model and tokenizer
|
| 81 |
+
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
|
| 82 |
+
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
|
| 83 |
+
|
| 84 |
+
# Generate text
|
| 85 |
+
input_text = "Il bambino gioca con"
|
| 86 |
+
inputs = tokenizer.encode(input_text, return_tensors="pt")
|
| 87 |
+
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
|
| 88 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 89 |
+
print(generated_text)
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
### Text Generation Parameters
|
| 93 |
+
- **Max Length:** 50 tokens (default)
|
| 94 |
+
- **Sampling:** Enabled by default
|
| 95 |
+
- **Temperature:** Adjustable (0.8 recommended)
|
| 96 |
+
|
| 97 |
+
## Limitations and Biases
|
| 98 |
+
|
| 99 |
+
### Known Limitations
|
| 100 |
+
- **Limited training data** (3M words) may result in knowledge gaps
|
| 101 |
+
- **Domain specificity** due to child-directed speech focus
|
| 102 |
+
- **Context window** limited to 1,024 tokens
|
| 103 |
+
|
| 104 |
+
### Potential Biases
|
| 105 |
+
- **Age-appropriate content bias** from training data selection
|
| 106 |
+
- **Italian language bias** (monolingual training)
|
| 107 |
+
- **Morphological bias** toward Indo-European language patterns
|
| 108 |
+
|
| 109 |
+
## Technical Specifications
|
| 110 |
+
|
| 111 |
+
### Training Infrastructure
|
| 112 |
+
- **Framework:** PyTorch + Transformers
|
| 113 |
+
- **Precision:** float32
|
| 114 |
+
- **Gradient Accumulation:** Configured for effective batch size
|
| 115 |
+
- **Monitoring:** Weights & Biases integration
|
| 116 |
+
|
| 117 |
+
### Model Configuration
|
| 118 |
+
```json
|
| 119 |
+
{
|
| 120 |
+
"activation_function": "gelu_new",
|
| 121 |
+
"architectures": ["GPT2LMHeadModel"],
|
| 122 |
+
"attn_pdrop": 0.1,
|
| 123 |
+
"embd_pdrop": 0.1,
|
| 124 |
+
"layer_norm_epsilon": 1e-05,
|
| 125 |
+
"n_ctx": 1024,
|
| 126 |
+
"n_embd": 768,
|
| 127 |
+
"n_head": 12,
|
| 128 |
+
"n_layer": 12,
|
| 129 |
+
"vocab_size": 16384
|
| 130 |
+
}
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
## Citation
|
| 134 |
+
|
| 135 |
+
If you use this model in your research, please cite:
|
| 136 |
+
|
| 137 |
+
```bibtex
|
| 138 |
+
@inproceedings{fusco-etal-2024-recurrent,
|
| 139 |
+
title = "Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-{LM} Training on Child-Directed Speech in {I}talian",
|
| 140 |
+
author = "Fusco, Achille and
|
| 141 |
+
Barbini, Matilde and
|
| 142 |
+
Piccini Bianchessi, Maria Letizia and
|
| 143 |
+
Bressan, Veronica and
|
| 144 |
+
Neri, Sofia and
|
| 145 |
+
Rossi, Sarah and
|
| 146 |
+
Sgrizzi, Tommaso and
|
| 147 |
+
Chesi, Cristiano",
|
| 148 |
+
editor = "Dell'Orletta, Felice and
|
| 149 |
+
Lenci, Alessandro and
|
| 150 |
+
Montemagni, Simonetta and
|
| 151 |
+
Sprugnoli, Rachele",
|
| 152 |
+
booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
|
| 153 |
+
month = dec,
|
| 154 |
+
year = "2024",
|
| 155 |
+
address = "Pisa, Italy",
|
| 156 |
+
publisher = "CEUR Workshop Proceedings",
|
| 157 |
+
url = "https://aclanthology.org/2024.clicit-1.46/",
|
| 158 |
+
pages = "382--389",
|
| 159 |
+
ISBN = "979-12-210-7060-6"
|
| 160 |
+
}
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
## Acknowledgments
|
| 164 |
+
|
| 165 |
+
- **BabyLM 2025 Challenge** organizers for providing the framework
|
| 166 |
+
- **Hugging Face Transformers** team for the modeling infrastructure
|
| 167 |
+
|
| 168 |
+
## Contact
|
| 169 |
+
|
| 170 |
+
For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
*This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.*
|