Aloukik21 commited on
Commit
17236cc
·
verified ·
1 Parent(s): f322631

Update README with full documentation

Browse files
Files changed (1) hide show
  1. README.md +64 -8
README.md CHANGED
@@ -1,6 +1,17 @@
 
 
 
 
 
 
 
 
 
1
  # AI Trainer - RunPod Serverless
2
 
3
- Multi-model LoRA training service using [ai-toolkit](https://github.com/ostris/ai-toolkit).
 
 
4
 
5
  ## Supported Models
6
 
@@ -17,23 +28,68 @@ Multi-model LoRA training service using [ai-toolkit](https://github.com/ostris/a
17
  ## API Usage
18
 
19
  ### List Models
20
-
 
 
21
 
22
  ### Check Status
 
 
 
23
 
 
 
 
 
24
 
25
  ### Train LoRA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## RunPod Deployment
29
 
30
  ### Environment Variables
31
- - HF_TOKEN: HuggingFace token for gated models
32
 
33
  ### Model Caching
34
- For faster cold starts, set one of these HuggingFace repos in the RunPod "Model" field:
35
- - FLUX: black-forest-labs/FLUX.1-dev
36
- - Wan 2.2: ai-toolkit/Wan2.2-T2V-A14B-Diffusers-bf16
37
- - Qwen: Qwen/Qwen-Image
 
 
 
 
38
 
39
- Note: RunPod only supports ONE cached model per endpoint.
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - lora
5
+ - training
6
+ - runpod
7
+ - ai-toolkit
8
+ ---
9
+
10
  # AI Trainer - RunPod Serverless
11
 
12
+ Single-endpoint multi-model LoRA training service using [ai-toolkit](https://github.com/ostris/ai-toolkit).
13
+
14
+ Automatically cleans up GPU memory when switching between different models.
15
 
16
  ## Supported Models
17
 
 
28
  ## API Usage
29
 
30
  ### List Models
31
+ ```json
32
+ {"input": {"action": "list_models"}}
33
+ ```
34
 
35
  ### Check Status
36
+ ```json
37
+ {"input": {"action": "status"}}
38
+ ```
39
 
40
+ ### Manual Cleanup
41
+ ```json
42
+ {"input": {"action": "cleanup"}}
43
+ ```
44
 
45
  ### Train LoRA
46
+ ```json
47
+ {
48
+ "input": {
49
+ "action": "train",
50
+ "model": "flux_dev",
51
+ "params": {
52
+ "dataset_path": "/workspace/dataset",
53
+ "output_path": "/workspace/output",
54
+ "steps": 1000,
55
+ "batch_size": 1,
56
+ "learning_rate": 1e-4,
57
+ "lora_rank": 16
58
+ }
59
+ }
60
+ }
61
+ ```
62
 
63
+ ## Training Parameters
64
+
65
+ | Parameter | Description | Default |
66
+ |-----------|-------------|---------|
67
+ | dataset_path | Path to training images | /workspace/dataset |
68
+ | output_path | Output directory | /workspace/output |
69
+ | steps | Training steps | 2000 |
70
+ | batch_size | Batch size | 1 |
71
+ | learning_rate | Learning rate | 1e-4 |
72
+ | lora_rank | LoRA rank | 16-32 |
73
+ | save_every | Save checkpoint interval | 250 |
74
+ | sample_every | Sample generation interval | 250 |
75
+ | trigger_word | Trigger word for training | None |
76
 
77
  ## RunPod Deployment
78
 
79
  ### Environment Variables
80
+ - `HF_TOKEN`: HuggingFace token for gated models (required for FLUX, Qwen)
81
 
82
  ### Model Caching
83
+ Models are cached at `/runpod-volume/huggingface-cache/hub/` for faster subsequent loads.
84
+
85
+ For optimal cold starts, set the RunPod **Model** field to one of:
86
+ - `black-forest-labs/FLUX.1-dev` (for FLUX training)
87
+ - `ai-toolkit/Wan2.2-T2V-A14B-Diffusers-bf16` (for Wan 2.2)
88
+ - `Qwen/Qwen-Image` (for Qwen Image)
89
+
90
+ ## Auto-Cleanup
91
 
92
+ The handler automatically cleans up GPU memory when switching between models:
93
+ - Full cleanup when changing model types
94
+ - Light cleanup for same model
95
+ - Manual cleanup via `cleanup` action