docs: Best datasets guide with verified formats + proven recipes
Browse files
ai-ml/hf-finetuning/BEST_DATASETS.md
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Best Datasets for SFT Fine-Tuning — Verified Guide
|
| 2 |
+
|
| 3 |
+
## Dataset Rankings (Quality → Model Performance)
|
| 4 |
+
|
| 5 |
+
### #1: allenai/tulu-3-sft-mixture — THE BEST
|
| 6 |
+
- **Size**: 939K examples from 19 curated sources
|
| 7 |
+
- **Format**: messages column (role/content) - ZERO PREPROCESSING
|
| 8 |
+
- **Sources**: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
|
| 9 |
+
- **Proven Results on Llama-3.1-8B**: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
|
| 10 |
+
- **Training Recipe**: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
|
| 11 |
+
- **Status**: VALIDATED - column format confirmed via hf_inspect_dataset
|
| 12 |
+
|
| 13 |
+
### #2: open-thoughts/OpenThoughts-114k — REASONING CoT
|
| 14 |
+
- **Size**: 114K examples with DeepSeek-R1 reasoning traces
|
| 15 |
+
- **Format**: conversations column (from/value ShareGPT) - NEEDS CONVERSION
|
| 16 |
+
- **Best For**: Math, code, science with chain-of-thought
|
| 17 |
+
- **Conversion**: See train_openthoughts.py
|
| 18 |
+
- **Training Recipe**: LR=2e-4, batch=16, epochs=2, cosine schedule
|
| 19 |
+
- **Status**: VALIDATED - format confirmed, converter tested
|
| 20 |
+
|
| 21 |
+
### #3: HuggingFaceH4/ultrachat_200k — GENERAL CHAT
|
| 22 |
+
- **Size**: 208K multi-turn conversations
|
| 23 |
+
- **Format**: messages column - ZERO PREPROCESSING (use train_sft split)
|
| 24 |
+
- **Best For**: General conversational ability
|
| 25 |
+
- **Training Recipe**: LR=2e-4, batch=16, epochs=1
|
| 26 |
+
|
| 27 |
+
### #4: mlabonne/FineTome-100k — CURATED COMPACT
|
| 28 |
+
- **Size**: 100K quality-scored examples
|
| 29 |
+
- **Format**: conversations (ShareGPT) - NEEDS CONVERSION
|
| 30 |
+
- **Best For**: Quick fine-tune with curated quality
|
| 31 |
+
|
| 32 |
+
### #5: HuggingFaceH4/no_robots — HUMAN-WRITTEN
|
| 33 |
+
- **Size**: 9.5K examples (all human-written)
|
| 34 |
+
- **Format**: messages column - ZERO PREPROCESSING
|
| 35 |
+
- **Best For**: High-quality instruction following
|
| 36 |
+
|
| 37 |
+
## How to Train
|
| 38 |
+
|
| 39 |
+
### Full Training (Tulu 3 - 940K) — A100 80GB, ~6h
|
| 40 |
+
```
|
| 41 |
+
python ai-ml/hf-finetuning/train_tulu3.py
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
### Reasoning Training (OpenThoughts - 114K) — A100 80GB, ~2h
|
| 45 |
+
```
|
| 46 |
+
python ai-ml/hf-finetuning/train_openthoughts.py
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Quick Test (100 steps) — Any GPU
|
| 50 |
+
```
|
| 51 |
+
python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## LoRA Config (LoRA Without Regret - Schulman 2025)
|
| 55 |
+
|
| 56 |
+
| Parameter | Tulu 3 Recipe | OpenThoughts Recipe |
|
| 57 |
+
|-----------|---------------|---------------------|
|
| 58 |
+
| lora_r | 256 | 256 |
|
| 59 |
+
| lora_alpha | 16 | 16 |
|
| 60 |
+
| target_modules | all-linear | all-linear |
|
| 61 |
+
| learning_rate | 5e-6 | 2e-4 |
|
| 62 |
+
| effective_batch | 128 | 16 |
|
| 63 |
+
| epochs | 2 | 2 |
|
| 64 |
+
| max_seq_length | 4096 | 4096 |
|
| 65 |
+
| lr_schedule | linear | cosine |
|
| 66 |
+
| packing | True (bfd_split) | True (bfd_split) |
|
| 67 |
+
| assistant_only_loss | True | True |
|
| 68 |
+
|
| 69 |
+
## Key Research Sources
|
| 70 |
+
- Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
|
| 71 |
+
- LoRA Without Regret: Schulman et al., 2025
|
| 72 |
+
- Data quality > quantity: arXiv 2402.05123
|