Upload ZEROGPU_SETUP.md with huggingface_hub
Browse files- ZEROGPU_SETUP.md +187 -0
ZEROGPU_SETUP.md
ADDED
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# π ZeroGPU Setup Guide: Free H200 Training
|
2 |
+
|
3 |
+
## π― What is ZeroGPU?
|
4 |
+
|
5 |
+
**ZeroGPU** is Hugging Face's **FREE** compute service that provides:
|
6 |
+
- **Nvidia H200 GPU** (70GB memory)
|
7 |
+
- **No time limits** (unlike the 4-minute daily limit)
|
8 |
+
- **No credit card required**
|
9 |
+
- **Perfect for training** nanoGPT models
|
10 |
+
|
11 |
+
## π ZeroGPU vs Previous Approach
|
12 |
+
|
13 |
+
| Feature | Previous (HF Spaces) | ZeroGPU |
|
14 |
+
|---------|---------------------|---------|
|
15 |
+
| **GPU** | H200 (4 min/day) | H200 (unlimited) |
|
16 |
+
| **Memory** | Limited | 70GB |
|
17 |
+
| **Time** | 4 minutes daily | No limits |
|
18 |
+
| **Cost** | Free | Free |
|
19 |
+
| **Use Case** | Demos/Testing | Real Training |
|
20 |
+
|
21 |
+
## π How to Use ZeroGPU
|
22 |
+
|
23 |
+
### Option 1: Hugging Face Training Cluster (Recommended)
|
24 |
+
|
25 |
+
1. **Create HF Model Repository:**
|
26 |
+
```bash
|
27 |
+
huggingface-cli repo create nano-coder-zerogpu --type model
|
28 |
+
```
|
29 |
+
|
30 |
+
2. **Upload Training Files:**
|
31 |
+
```bash
|
32 |
+
python upload_to_zerogpu.py
|
33 |
+
```
|
34 |
+
|
35 |
+
3. **Launch ZeroGPU Training:**
|
36 |
+
```bash
|
37 |
+
python launch_zerogpu.py
|
38 |
+
```
|
39 |
+
|
40 |
+
### Option 2: Direct ZeroGPU API
|
41 |
+
|
42 |
+
1. **Install HF Hub:**
|
43 |
+
```bash
|
44 |
+
pip install huggingface_hub
|
45 |
+
```
|
46 |
+
|
47 |
+
2. **Set HF Token:**
|
48 |
+
```bash
|
49 |
+
export HF_TOKEN="your_token_here"
|
50 |
+
```
|
51 |
+
|
52 |
+
3. **Run ZeroGPU Training:**
|
53 |
+
```bash
|
54 |
+
python zerogpu_training.py
|
55 |
+
```
|
56 |
+
|
57 |
+
## π Files for ZeroGPU
|
58 |
+
|
59 |
+
- `zerogpu_training.py` - Main training script
|
60 |
+
- `upload_to_zerogpu.py` - Upload files to HF
|
61 |
+
- `launch_zerogpu.py` - Launch training job
|
62 |
+
- `ZEROGPU_SETUP.md` - This guide
|
63 |
+
|
64 |
+
## βοΈ ZeroGPU Configuration
|
65 |
+
|
66 |
+
### Model Settings (Full Power!)
|
67 |
+
- **Layers**: 12 (full model)
|
68 |
+
- **Heads**: 12 (full model)
|
69 |
+
- **Embedding**: 768 (full model)
|
70 |
+
- **Context**: 1024 tokens
|
71 |
+
- **Parameters**: ~124M (full GPT-2 size)
|
72 |
+
|
73 |
+
### Training Settings
|
74 |
+
- **Batch Size**: 48 (optimized for H200)
|
75 |
+
- **Learning Rate**: 6e-4 (standard GPT-2)
|
76 |
+
- **Iterations**: 10,000 (no time limits!)
|
77 |
+
- **Checkpoints**: Every 1000 iterations
|
78 |
+
|
79 |
+
## π― Expected Results
|
80 |
+
|
81 |
+
With ZeroGPU H200 (no time limits):
|
82 |
+
- **Training Time**: 2-4 hours
|
83 |
+
- **Final Loss**: ~1.8-2.2
|
84 |
+
- **Model Quality**: Production-ready
|
85 |
+
- **Code Generation**: High quality Python code
|
86 |
+
|
87 |
+
## π§ Setup Steps
|
88 |
+
|
89 |
+
### Step 1: Create HF Repository
|
90 |
+
```bash
|
91 |
+
huggingface-cli repo create nano-coder-zerogpu --type model
|
92 |
+
```
|
93 |
+
|
94 |
+
### Step 2: Prepare Dataset
|
95 |
+
```bash
|
96 |
+
python prepare_code_dataset.py
|
97 |
+
```
|
98 |
+
|
99 |
+
### Step 3: Launch Training
|
100 |
+
```bash
|
101 |
+
python zerogpu_training.py
|
102 |
+
```
|
103 |
+
|
104 |
+
## π Monitoring
|
105 |
+
|
106 |
+
### Wandb Dashboard
|
107 |
+
- Real-time training metrics
|
108 |
+
- Loss curves
|
109 |
+
- Model performance
|
110 |
+
|
111 |
+
### HF Hub
|
112 |
+
- Automatic checkpoint uploads
|
113 |
+
- Model versioning
|
114 |
+
- Training logs
|
115 |
+
|
116 |
+
## π° Cost: **$0** (Completely Free!)
|
117 |
+
|
118 |
+
- **No credit card required**
|
119 |
+
- **No time limits**
|
120 |
+
- **H200 GPU access**
|
121 |
+
- **70GB memory**
|
122 |
+
|
123 |
+
## π Benefits of ZeroGPU
|
124 |
+
|
125 |
+
1. **No Time Limits** - Train for hours, not minutes
|
126 |
+
2. **Full Model** - Use complete GPT-2 architecture
|
127 |
+
3. **Better Results** - Production-quality models
|
128 |
+
4. **Real Training** - Not just demos
|
129 |
+
5. **Automatic Saving** - Models saved to HF Hub
|
130 |
+
|
131 |
+
## π¨ Troubleshooting
|
132 |
+
|
133 |
+
### If Training Won't Start
|
134 |
+
1. Check HF token is set
|
135 |
+
2. Verify repository exists
|
136 |
+
3. Check dataset is prepared
|
137 |
+
|
138 |
+
### If Out of Memory
|
139 |
+
1. Reduce batch_size to 32
|
140 |
+
2. Reduce gradient_accumulation_steps
|
141 |
+
3. Use smaller model (but why?)
|
142 |
+
|
143 |
+
### If Upload Fails
|
144 |
+
1. Check internet connection
|
145 |
+
2. Verify HF token permissions
|
146 |
+
3. Check repository access
|
147 |
+
|
148 |
+
## π― Use Cases
|
149 |
+
|
150 |
+
### Perfect For:
|
151 |
+
- β
**Production Training** - Real model training
|
152 |
+
- β
**Research** - Experiment with different configs
|
153 |
+
- β
**Learning** - Understand full training process
|
154 |
+
- β
**Model Sharing** - Upload to HF Hub
|
155 |
+
|
156 |
+
### Not Suitable For:
|
157 |
+
- β **Quick Demos** - Use HF Spaces for that
|
158 |
+
- β **Testing** - Use local GPU for that
|
159 |
+
|
160 |
+
## π Workflow
|
161 |
+
|
162 |
+
1. **Setup**: Create HF repo and prepare data
|
163 |
+
2. **Train**: Launch ZeroGPU training
|
164 |
+
3. **Monitor**: Watch progress on Wandb
|
165 |
+
4. **Save**: Models automatically uploaded
|
166 |
+
5. **Share**: Use trained models
|
167 |
+
|
168 |
+
## π Performance
|
169 |
+
|
170 |
+
Expected training performance on ZeroGPU H200:
|
171 |
+
- **Iterations/second**: ~2-3
|
172 |
+
- **Memory usage**: ~40-50GB
|
173 |
+
- **Training time**: 2-4 hours for 10k iterations
|
174 |
+
- **Final model**: Production quality
|
175 |
+
|
176 |
+
## π Success!
|
177 |
+
|
178 |
+
ZeroGPU is the **proper way** to use Hugging Face's free compute for real training. No more 4-minute limits - train your nano-coder model properly!
|
179 |
+
|
180 |
+
**Next Steps:**
|
181 |
+
1. Create HF repository
|
182 |
+
2. Upload files
|
183 |
+
3. Launch training
|
184 |
+
4. Monitor progress
|
185 |
+
5. Use your trained model!
|
186 |
+
|
187 |
+
Happy ZeroGPU training! π
|