|
--- |
|
title: README |
|
emoji: π |
|
colorFrom: red |
|
colorTo: yellow |
|
sdk: static |
|
pinned: false |
|
--- |
|
|
|
# Pico: Tiny Language Models for Learning Dynamics Research |
|
|
|
Pico consists of two key components: |
|
1. **Pre-trained Model Suite** (hosted here on HuggingFace) |
|
2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico)) |
|
|
|
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch. |
|
|
|
## π€ HuggingFace Resources (You Are Here) |
|
|
|
> π§ **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates! |
|
|
|
### Pre-trained Model Suite (Releasing January 2025) |
|
Our complete suite of models from 1M to 1B parameters: |
|
- **pico-tiny** (1M parameters) |
|
- **pico-small** (10M parameters) |
|
- **pico-medium** (100M parameters) |
|
- **pico-large** (500M parameters) |
|
- **pico-xl** (1B parameters) |
|
|
|
Each model includes: |
|
- Complete training checkpoints |
|
- Saved activations and gradients |
|
- Pre-computed evaluation perplexity scores |
|
|
|
### Available Datasets |
|
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)** |
|
- 420B tokens of pre-processed text |
|
- Cleaned and shuffled DOLMA corpus |
|
|
|
2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)** |
|
- Smaller version for quick experiments |
|
|
|
3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)** |
|
- Batch of eval data for generating model activations |
|
|
|
## π§ GitHub Training Framework |
|
|
|
Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to: |
|
- Train models with custom architectures |
|
- Experiment with different training regimes |
|
- Modify checkpoint saving behavior |
|
- Implement custom evaluation metrics |
|
|
|
The training framework makes it easy to: |
|
1. Train multiple models of different sizes |
|
2. Ensure consistent training across all models |
|
3. Save rich checkpoint data for learning dynamics analysis |
|
4. Compare learning dynamics across scales |
|
|
|
## π οΈ Using the Resources |
|
|
|
### Using Pre-trained Models (HuggingFace) |
|
```python |
|
from transformers import AutoModelForCausalLM |
|
|
|
# Load our pre-trained model |
|
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small") |
|
|
|
# Access specific checkpoint |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"pico-lm/pico-small", |
|
revision="step-xyz" |
|
) |
|
``` |
|
|
|
### Training Your Own Suite (GitHub) |
|
```bash |
|
# Clone the repository |
|
git clone https://github.com/rdiehlmartinez/pico.git && cd pico |
|
source setup.sh |
|
|
|
# Configure your model suite |
|
# Edit configs/train.yaml to specify model sizes and training parameters |
|
|
|
# Train your suite |
|
python train.py --config configs/train.yaml |
|
``` |
|
|
|
## π Model Details |
|
|
|
### Architecture |
|
All models (both pre-trained and self-trained) use: |
|
- LLAMA-style transformer |
|
- RMSNorm for normalization |
|
- RoPE positional embeddings |
|
- Multi-head attention with KV-cache |
|
- SwiGLU activation function |
|
|
|
### Training Configuration |
|
Standard configuration (customizable in GitHub training): |
|
- Batch size: 1024 |
|
- Learning rate: 1e-3 |
|
- Weight decay: 0.1 |
|
- Gradient clipping: 1.0 |
|
- Mixed precision training |
|
|
|
## π¬ Research Applications |
|
|
|
Perfect for researchers studying: |
|
- Learning dynamics across model scales |
|
- Mechanistic interpretability |
|
- Architecture and training effects |
|
- Emergent model behaviors |
|
|
|
Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research. |
|
|
|
## π€ Contributing |
|
|
|
Contributions welcome on both platforms: |
|
- **HuggingFace**: Model weights, datasets, and evaluation results |
|
- **GitHub**: Training framework improvements, analysis tools, and documentation |
|
|
|
## π« Contact |
|
|
|
- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico) |
|
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com) |
|
|
|
## π Citation |
|
|
|
```bibtex |
|
@software{pico2024, |
|
author = {Diehl Martinez, Richard}, |
|
title = {Pico: Framework for Training Tiny Language Models}, |
|
year = {2024}, |
|
} |
|
``` |