NeTS-lab
/

babylm_ita-bpe-3m-gpt2

PyTorch

gpt2

Model card Files Files and versions

xet

Community

NeTS-lab commited on Aug 21

Commit

be74f1b

verified ·

1 Parent(s): 91b21ea

Update README.md

Browse files

Files changed (1) hide show

README.md +174 -3

README.md CHANGED Viewed

@@ -1,3 +1,174 @@
----
-license: mit
----

+---
+license: mit
+metrics:
+- accuracy
+---
+# BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)
+## Model Description
+This is a GPT-2 language model trained adapting the baseline model built for the **BabyLM 2025 Challenge**.
+- **Developed by:** NeTS Lab
+- **Model type:** Autoregressive Language Model (GPT-2 architecture)
+- **Language(s):** Italian
+- **License:** MIT
+- **Parent Model:** GPT-2
+- **Tokenizer:** BPE
+## Key Features
+- **Strict data constraints** (3M words) child-directed speech corpus
+- **Optimized for data efficiency** default BabyLM 2025 baseline hyperparameter tuning
+- **768-dimensional embeddings** with 12 attention heads and 12 layers
+## Model Details
+### Architecture
+- **Base Architecture:** GPT-2 (12 layers, 12 attention heads)
+- **Hidden Size:** 768
+- **Vocabulary Size:** ~~16K
+- **Context Length:** 1,024 tokens
+- **Parameters:** ~~104M (estimated)
+### Training Configuration
+- **Training Type:** Strict (BabyLM 2025 guidelines)
+- **Dataset Size:** 3M words maximum
+- **Sequence Length:** 512 tokens
+- **Batch Size:** 16
+- **Learning Rate:** 5e-5
+- **Training Steps:** 200,000
+- **Warmup Steps:** 2,000
+- **Epochs:** 10
+- **Weight Decay:** 0.0
+- **Gradient Clipping:** 1.0
+## Training Data
+The model was trained on a small italian dataset (Fusco et al. 2024), which includes:
+- **Size:** 3M words maximum
+- **Sources:** Child-directed speech and age-appropriate text
+- **Language:** Italian
+## Intended Uses
+### Primary Use Cases
+- **Research** into data-efficient language modeling
+- **Comparative studies** of tokenization methods in low-resource settings
+- **Baseline model** for BabyLM 2025 Challenge participants
+### Out-of-Scope Uses
+- **Production deployments** requiring robust, general-purpose language understanding
+- **Safety-critical applications**
+- **Tasks requiring knowledge beyond the training data scope**
+## Performance
+The model was trained following BabyLM 2025 Challenge protocols:
+- **Training loss:** 2.51947
+- **Convergence:** Achieved after 200,000 training steps
+## Usage
+### Loading the Model
+```python
+from transformers import GPT2LMHeadModel, GPT2Tokenizer
+# Load model and tokenizer
+model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
+tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
+# Generate text
+input_text = "Il bambino gioca con"
+inputs = tokenizer.encode(input_text, return_tensors="pt")
+outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+### Text Generation Parameters
+- **Max Length:** 50 tokens (default)
+- **Sampling:** Enabled by default
+- **Temperature:** Adjustable (0.8 recommended)
+## Limitations and Biases
+### Known Limitations
+- **Limited training data** (3M words) may result in knowledge gaps
+- **Domain specificity** due to child-directed speech focus
+- **Context window** limited to 1,024 tokens
+### Potential Biases
+- **Age-appropriate content bias** from training data selection
+- **Italian language bias** (monolingual training)
+- **Morphological bias** toward Indo-European language patterns
+## Technical Specifications
+### Training Infrastructure
+- **Framework:** PyTorch + Transformers
+- **Precision:** float32
+- **Gradient Accumulation:** Configured for effective batch size
+- **Monitoring:** Weights & Biases integration
+### Model Configuration
+```json
+{
+  "activation_function": "gelu_new",
+  "architectures": ["GPT2LMHeadModel"],
+  "attn_pdrop": 0.1,
+  "embd_pdrop": 0.1,
+  "layer_norm_epsilon": 1e-05,
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_layer": 12,
+  "vocab_size": 16384
+}
+```
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@inproceedings{fusco-etal-2024-recurrent,
+    title = "Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-{LM} Training on Child-Directed Speech in {I}talian",
+    author = "Fusco, Achille  and
+      Barbini, Matilde  and
+      Piccini Bianchessi, Maria Letizia  and
+      Bressan, Veronica  and
+      Neri, Sofia  and
+      Rossi, Sarah  and
+      Sgrizzi, Tommaso  and
+      Chesi, Cristiano",
+    editor = "Dell'Orletta, Felice  and
+      Lenci, Alessandro  and
+      Montemagni, Simonetta  and
+      Sprugnoli, Rachele",
+    booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
+    month = dec,
+    year = "2024",
+    address = "Pisa, Italy",
+    publisher = "CEUR Workshop Proceedings",
+    url = "https://aclanthology.org/2024.clicit-1.46/",
+    pages = "382--389",
+    ISBN = "979-12-210-7060-6"
+}
+```
+## Acknowledgments
+- **BabyLM 2025 Challenge** organizers for providing the framework
+- **Hugging Face Transformers** team for the modeling infrastructure
+## Contact
+For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].
+---
+*This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.*