shaikhsalman commited on
Commit
292504a
·
verified ·
1 Parent(s): 3aaeeb3

docs: Best datasets guide with verified formats + proven recipes

Browse files
ai-ml/hf-finetuning/BEST_DATASETS.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Best Datasets for SFT Fine-Tuning — Verified Guide
2
+
3
+ ## Dataset Rankings (Quality → Model Performance)
4
+
5
+ ### #1: allenai/tulu-3-sft-mixture — THE BEST
6
+ - **Size**: 939K examples from 19 curated sources
7
+ - **Format**: messages column (role/content) - ZERO PREPROCESSING
8
+ - **Sources**: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
9
+ - **Proven Results on Llama-3.1-8B**: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
10
+ - **Training Recipe**: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
11
+ - **Status**: VALIDATED - column format confirmed via hf_inspect_dataset
12
+
13
+ ### #2: open-thoughts/OpenThoughts-114k — REASONING CoT
14
+ - **Size**: 114K examples with DeepSeek-R1 reasoning traces
15
+ - **Format**: conversations column (from/value ShareGPT) - NEEDS CONVERSION
16
+ - **Best For**: Math, code, science with chain-of-thought
17
+ - **Conversion**: See train_openthoughts.py
18
+ - **Training Recipe**: LR=2e-4, batch=16, epochs=2, cosine schedule
19
+ - **Status**: VALIDATED - format confirmed, converter tested
20
+
21
+ ### #3: HuggingFaceH4/ultrachat_200k — GENERAL CHAT
22
+ - **Size**: 208K multi-turn conversations
23
+ - **Format**: messages column - ZERO PREPROCESSING (use train_sft split)
24
+ - **Best For**: General conversational ability
25
+ - **Training Recipe**: LR=2e-4, batch=16, epochs=1
26
+
27
+ ### #4: mlabonne/FineTome-100k — CURATED COMPACT
28
+ - **Size**: 100K quality-scored examples
29
+ - **Format**: conversations (ShareGPT) - NEEDS CONVERSION
30
+ - **Best For**: Quick fine-tune with curated quality
31
+
32
+ ### #5: HuggingFaceH4/no_robots — HUMAN-WRITTEN
33
+ - **Size**: 9.5K examples (all human-written)
34
+ - **Format**: messages column - ZERO PREPROCESSING
35
+ - **Best For**: High-quality instruction following
36
+
37
+ ## How to Train
38
+
39
+ ### Full Training (Tulu 3 - 940K) — A100 80GB, ~6h
40
+ ```
41
+ python ai-ml/hf-finetuning/train_tulu3.py
42
+ ```
43
+
44
+ ### Reasoning Training (OpenThoughts - 114K) — A100 80GB, ~2h
45
+ ```
46
+ python ai-ml/hf-finetuning/train_openthoughts.py
47
+ ```
48
+
49
+ ### Quick Test (100 steps) — Any GPU
50
+ ```
51
+ python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push
52
+ ```
53
+
54
+ ## LoRA Config (LoRA Without Regret - Schulman 2025)
55
+
56
+ | Parameter | Tulu 3 Recipe | OpenThoughts Recipe |
57
+ |-----------|---------------|---------------------|
58
+ | lora_r | 256 | 256 |
59
+ | lora_alpha | 16 | 16 |
60
+ | target_modules | all-linear | all-linear |
61
+ | learning_rate | 5e-6 | 2e-4 |
62
+ | effective_batch | 128 | 16 |
63
+ | epochs | 2 | 2 |
64
+ | max_seq_length | 4096 | 4096 |
65
+ | lr_schedule | linear | cosine |
66
+ | packing | True (bfd_split) | True (bfd_split) |
67
+ | assistant_only_loss | True | True |
68
+
69
+ ## Key Research Sources
70
+ - Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
71
+ - LoRA Without Regret: Schulman et al., 2025
72
+ - Data quality > quantity: arXiv 2402.05123