|
# π ZeroGPU Setup Guide: Free H200 Training |
|
|
|
## π― What is ZeroGPU? |
|
|
|
**ZeroGPU** is Hugging Face's **FREE** compute service that provides: |
|
- **Nvidia H200 GPU** (70GB memory) |
|
- **No time limits** (unlike the 4-minute daily limit) |
|
- **No credit card required** |
|
- **Perfect for training** nanoGPT models |
|
|
|
## π ZeroGPU vs Previous Approach |
|
|
|
| Feature | Previous (HF Spaces) | ZeroGPU | |
|
|---------|---------------------|---------| |
|
| **GPU** | H200 (4 min/day) | H200 (unlimited) | |
|
| **Memory** | Limited | 70GB | |
|
| **Time** | 4 minutes daily | No limits | |
|
| **Cost** | Free | Free | |
|
| **Use Case** | Demos/Testing | Real Training | |
|
|
|
## π How to Use ZeroGPU |
|
|
|
### Option 1: Hugging Face Training Cluster (Recommended) |
|
|
|
1. **Create HF Model Repository:** |
|
```bash |
|
huggingface-cli repo create nano-coder-zerogpu --type model |
|
``` |
|
|
|
2. **Upload Training Files:** |
|
```bash |
|
python upload_to_zerogpu.py |
|
``` |
|
|
|
3. **Launch ZeroGPU Training:** |
|
```bash |
|
python launch_zerogpu.py |
|
``` |
|
|
|
### Option 2: Direct ZeroGPU API |
|
|
|
1. **Install HF Hub:** |
|
```bash |
|
pip install huggingface_hub |
|
``` |
|
|
|
2. **Set HF Token:** |
|
```bash |
|
export HF_TOKEN="your_token_here" |
|
``` |
|
|
|
3. **Run ZeroGPU Training:** |
|
```bash |
|
python zerogpu_training.py |
|
``` |
|
|
|
## π Files for ZeroGPU |
|
|
|
- `zerogpu_training.py` - Main training script |
|
- `upload_to_zerogpu.py` - Upload files to HF |
|
- `launch_zerogpu.py` - Launch training job |
|
- `ZEROGPU_SETUP.md` - This guide |
|
|
|
## βοΈ ZeroGPU Configuration |
|
|
|
### Model Settings (Full Power!) |
|
- **Layers**: 12 (full model) |
|
- **Heads**: 12 (full model) |
|
- **Embedding**: 768 (full model) |
|
- **Context**: 1024 tokens |
|
- **Parameters**: ~124M (full GPT-2 size) |
|
|
|
### Training Settings |
|
- **Batch Size**: 48 (optimized for H200) |
|
- **Learning Rate**: 6e-4 (standard GPT-2) |
|
- **Iterations**: 10,000 (no time limits!) |
|
- **Checkpoints**: Every 1000 iterations |
|
|
|
## π― Expected Results |
|
|
|
With ZeroGPU H200 (no time limits): |
|
- **Training Time**: 2-4 hours |
|
- **Final Loss**: ~1.8-2.2 |
|
- **Model Quality**: Production-ready |
|
- **Code Generation**: High quality Python code |
|
|
|
## π§ Setup Steps |
|
|
|
### Step 1: Create HF Repository |
|
```bash |
|
huggingface-cli repo create nano-coder-zerogpu --type model |
|
``` |
|
|
|
### Step 2: Prepare Dataset |
|
```bash |
|
python prepare_code_dataset.py |
|
``` |
|
|
|
### Step 3: Launch Training |
|
```bash |
|
python zerogpu_training.py |
|
``` |
|
|
|
## π Monitoring |
|
|
|
### Wandb Dashboard |
|
- Real-time training metrics |
|
- Loss curves |
|
- Model performance |
|
|
|
### HF Hub |
|
- Automatic checkpoint uploads |
|
- Model versioning |
|
- Training logs |
|
|
|
## π° Cost: **$0** (Completely Free!) |
|
|
|
- **No credit card required** |
|
- **No time limits** |
|
- **H200 GPU access** |
|
- **70GB memory** |
|
|
|
## π Benefits of ZeroGPU |
|
|
|
1. **No Time Limits** - Train for hours, not minutes |
|
2. **Full Model** - Use complete GPT-2 architecture |
|
3. **Better Results** - Production-quality models |
|
4. **Real Training** - Not just demos |
|
5. **Automatic Saving** - Models saved to HF Hub |
|
|
|
## π¨ Troubleshooting |
|
|
|
### If Training Won't Start |
|
1. Check HF token is set |
|
2. Verify repository exists |
|
3. Check dataset is prepared |
|
|
|
### If Out of Memory |
|
1. Reduce batch_size to 32 |
|
2. Reduce gradient_accumulation_steps |
|
3. Use smaller model (but why?) |
|
|
|
### If Upload Fails |
|
1. Check internet connection |
|
2. Verify HF token permissions |
|
3. Check repository access |
|
|
|
## π― Use Cases |
|
|
|
### Perfect For: |
|
- β
**Production Training** - Real model training |
|
- β
**Research** - Experiment with different configs |
|
- β
**Learning** - Understand full training process |
|
- β
**Model Sharing** - Upload to HF Hub |
|
|
|
### Not Suitable For: |
|
- β **Quick Demos** - Use HF Spaces for that |
|
- β **Testing** - Use local GPU for that |
|
|
|
## π Workflow |
|
|
|
1. **Setup**: Create HF repo and prepare data |
|
2. **Train**: Launch ZeroGPU training |
|
3. **Monitor**: Watch progress on Wandb |
|
4. **Save**: Models automatically uploaded |
|
5. **Share**: Use trained models |
|
|
|
## π Performance |
|
|
|
Expected training performance on ZeroGPU H200: |
|
- **Iterations/second**: ~2-3 |
|
- **Memory usage**: ~40-50GB |
|
- **Training time**: 2-4 hours for 10k iterations |
|
- **Final model**: Production quality |
|
|
|
## π Success! |
|
|
|
ZeroGPU is the **proper way** to use Hugging Face's free compute for real training. No more 4-minute limits - train your nano-coder model properly! |
|
|
|
**Next Steps:** |
|
1. Create HF repository |
|
2. Upload files |
|
3. Launch training |
|
4. Monitor progress |
|
5. Use your trained model! |
|
|
|
Happy ZeroGPU training! π |