File size: 4,279 Bytes
35ed80d
 
8ec16ff
35ed80d
 
 
 
 
 
8ec16ff
e948f36
8ec16ff
 
 
e948f36
8ec16ff
e948f36
8ec16ff
e948f36
8ec16ff
e948f36
8ec16ff
 
 
 
 
 
 
e948f36
8ec16ff
 
 
 
e948f36
8ec16ff
 
 
 
e948f36
8ec16ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53c0dc3
 
e948f36
8ec16ff
 
e948f36
8ec16ff
 
 
e948f36
8ec16ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e948f36
 
 
 
53c0dc3
8ec16ff
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: README
emoji: πŸ“ˆ
colorFrom: red
colorTo: yellow
sdk: static
pinned: false
---

# πŸ“ˆ Pico: Tiny Language Models for Learning Dynamics Research

Pico consists of two key components:
1. **Pre-trained Model Suite** (hosted here on HuggingFace)
2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.

## πŸ€— HuggingFace Resources (You Are Here)

> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!

### Pre-trained Model Suite (Releasing January 2025)
Our complete suite of models from 1M to 1B parameters:
- **pico-tiny** (1M parameters) 
- **pico-small** (10M parameters)
- **pico-medium** (100M parameters)
- **pico-large** (500M parameters)
- **pico-xl** (1B parameters)

Each model includes:
- Complete training checkpoints
- Saved activations and gradients
- Pre-computed evaluation perplexity scores

### Available Datasets
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
   - 420B tokens of pre-processed text
   - Cleaned and shuffled DOLMA corpus

2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
   - Smaller version for quick experiments

3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
   - Batch of eval data for generating model activations

## πŸ”§ GitHub Training Framework

Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
- Train models with custom architectures
- Experiment with different training regimes
- Modify checkpoint saving behavior
- Implement custom evaluation metrics

The training framework makes it easy to:
1. Train multiple models of different sizes
2. Ensure consistent training across all models
3. Save rich checkpoint data for learning dynamics analysis
4. Compare learning dynamics across scales

## πŸ› οΈ Using the Resources

### Using Pre-trained Models (HuggingFace)
```python
from transformers import AutoModelForCausalLM

# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")

# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "pico-lm/pico-small",
    revision="step-xyz"
)
```

### Training Your Own Suite (GitHub)
```bash
# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh

# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters

# Train your suite
python train.py --config configs/train.yaml
```

## πŸ“Š Model Details

### Architecture
All models (both pre-trained and self-trained) use:
- LLAMA-style transformer
- RMSNorm for normalization
- RoPE positional embeddings
- Multi-head attention with KV-cache
- SwiGLU activation function

### Training Configuration
Standard configuration (customizable in GitHub training):
- Batch size: 1024
- Learning rate: 1e-3
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision training

## πŸ”¬ Research Applications

Perfect for researchers studying:
- Learning dynamics across model scales
- Mechanistic interpretability
- Architecture and training effects
- Emergent model behaviors

Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.

## 🀝 Contributing

Contributions welcome on both platforms:
- **HuggingFace**: Model weights, datasets, and evaluation results
- **GitHub**: Training framework improvements, analysis tools, and documentation

## πŸ“« Contact

- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)

## πŸ” Citation

```bibtex
@software{pico2024,
    author = {Martinez, Richard Diehl},
    title = {Pico: Framework for Training Tiny Language Models},
    year = {2024},
}
```