Nanobit commited on
Commit
c2b64e4
1 Parent(s): 5760099

Feat: update doc (#1475) [skip ci]

Browse files

* feat: update doc contents

* chore: move batch vs ga docs

* feat: update lambdalabs instructions

* fix: refactor dev instructions

README.md CHANGED
@@ -221,23 +221,17 @@ For cloud GPU providers that support docker images, use [`winglian/axolotl-cloud
221
  python get-pip.py
222
  ```
223
 
224
- 3. Install torch
225
- ```bash
226
- pip3 install -U torch --index-url https://download.pytorch.org/whl/cu118
227
- ```
228
 
229
- 4. Axolotl
230
- ```bash
231
- git clone https://github.com/OpenAccess-AI-Collective/axolotl
232
- cd axolotl
233
 
234
- pip3 install packaging
235
- pip3 install -e '.[flash-attn,deepspeed]'
236
  pip3 install protobuf==3.20.3
237
  pip3 install -U --ignore-installed requests Pillow psutil scipy
238
  ```
239
 
240
- 5. Set path
241
  ```bash
242
  export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
243
  ```
@@ -389,66 +383,6 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
389
 
390
  See [these docs](docs/config.qmd) for all config options.
391
 
392
- <details>
393
- <summary> Understanding of batch size and gradient accumulation steps </summary>
394
- <br/>
395
- Gradient accumulation means accumulating gradients over several mini-batches and updating the model weights afterward. When the samples in each batch are diverse, this technique doesn't significantly impact learning.
396
-
397
- This method allows for effective training with larger effective batch sizes without needing proportionally larger memory. Here's why:
398
-
399
- 1. **Memory Consumption with Batch Size**: The primary reason increasing the batch size impacts memory is due to the storage requirements for intermediate activations. When you forward propagate a batch through a network, you have to store the activations at each layer for each sample in the batch, because these activations are used during backpropagation to compute gradients. Therefore, larger batches mean more activations, leading to greater GPU memory consumption.
400
-
401
- 2. **Gradient Accumulation**: With gradient accumulation, you're effectively simulating a larger batch size by accumulating gradients over several smaller batches (or micro-batches). However, at any given time, you're only forward and backward propagating a micro-batch. This means you only store activations for the micro-batch, not the full accumulated batch. As a result, you can simulate the effect of a larger batch size without the memory cost of storing activations for a large batch.
402
-
403
- **Example 1:**
404
- Micro batch size: 3
405
- Gradient accumulation steps: 2
406
- Number of GPUs: 3
407
- Total batch size = 3 * 2 * 3 = 18
408
-
409
- ```
410
- | GPU 1 | GPU 2 | GPU 3 |
411
- |----------------|----------------|----------------|
412
- | S1, S2, S3 | S4, S5, S6 | S7, S8, S9 |
413
- | e1, e2, e3 | e4, e5, e6 | e7, e8, e9 |
414
- |----------------|----------------|----------------|
415
- | → (accumulate) | → (accumulate) | → (accumulate) |
416
- |----------------|----------------|----------------|
417
- | S10, S11, S12 | S13, S14, S15 | S16, S17, S18 |
418
- | e10, e11, e12 | e13, e14, e15 | e16, e17, e18 |
419
- |----------------|----------------|----------------|
420
- | → (apply) | → (apply) | → (apply) |
421
-
422
- Accumulated gradient for the weight w1 after the second iteration (considering all GPUs):
423
- Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6 + e7 + e8 + e9 + e10 + e11 + e12 + e13 + e14 + e15 + e16 + e17 + e18
424
-
425
- Weight update for w1:
426
- w1_new = w1_old - learning rate x (Total gradient for w1 / 18)
427
- ```
428
-
429
- **Example 2:**
430
- Micro batch size: 2
431
- Gradient accumulation steps: 1
432
- Number of GPUs: 3
433
- Total batch size = 2 * 1 * 3 = 6
434
-
435
- ```
436
- | GPU 1 | GPU 2 | GPU 3 |
437
- |-----------|-----------|-----------|
438
- | S1, S2 | S3, S4 | S5, S6 |
439
- | e1, e2 | e3, e4 | e5, e6 |
440
- |-----------|-----------|-----------|
441
- | → (apply) | → (apply) | → (apply) |
442
-
443
- Accumulated gradient for the weight w1 (considering all GPUs):
444
- Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6
445
-
446
- Weight update for w1:
447
- w1_new = w1_old - learning rate × (Total gradient for w1 / 6)
448
- ```
449
-
450
- </details>
451
-
452
  ### Train
453
 
454
  Run
@@ -678,14 +612,8 @@ Bugs? Please check the [open issues](https://github.com/OpenAccess-AI-Collective
678
 
679
  PRs are **greatly welcome**!
680
 
681
- Please run below to setup env
682
  ```bash
683
- git clone https://github.com/OpenAccess-AI-Collective/axolotl
684
- cd axolotl
685
-
686
- pip3 install packaging ninja
687
- pip3 install -e '.[flash-attn,deepspeed]'
688
-
689
  pip3 install -r requirements-dev.txt -r requirements-tests.txt
690
  pre-commit install
691
 
 
221
  python get-pip.py
222
  ```
223
 
224
+ 3. Install Pytorch https://pytorch.org/get-started/locally/
 
 
 
225
 
226
+ 4. Follow instructions on quickstart.
 
 
 
227
 
228
+ 5. Run
229
+ ```bash
230
  pip3 install protobuf==3.20.3
231
  pip3 install -U --ignore-installed requests Pillow psutil scipy
232
  ```
233
 
234
+ 6. Set path
235
  ```bash
236
  export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
237
  ```
 
383
 
384
  See [these docs](docs/config.qmd) for all config options.
385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
386
  ### Train
387
 
388
  Run
 
612
 
613
  PRs are **greatly welcome**!
614
 
615
+ Please run the quickstart instructions followed by the below to setup env:
616
  ```bash
 
 
 
 
 
 
617
  pip3 install -r requirements-dev.txt -r requirements-tests.txt
618
  pre-commit install
619
 
docs/batch_vs_grad.qmd ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Batch size vs Gradient accumulation
3
+ description: Understanding of batch size and gradient accumulation steps
4
+ ---
5
+
6
+ Gradient accumulation means accumulating gradients over several mini-batches and updating the model weights afterward. When the samples in each batch are diverse, this technique doesn't significantly impact learning.
7
+
8
+ This method allows for effective training with larger effective batch sizes without needing proportionally larger memory. Here's why:
9
+
10
+ 1. **Memory Consumption with Batch Size**: The primary reason increasing the batch size impacts memory is due to the storage requirements for intermediate activations. When you forward propagate a batch through a network, you have to store the activations at each layer for each sample in the batch, because these activations are used during backpropagation to compute gradients. Therefore, larger batches mean more activations, leading to greater GPU memory consumption.
11
+
12
+ 2. **Gradient Accumulation**: With gradient accumulation, you're effectively simulating a larger batch size by accumulating gradients over several smaller batches (or micro-batches). However, at any given time, you're only forward and backward propagating a micro-batch. This means you only store activations for the micro-batch, not the full accumulated batch. As a result, you can simulate the effect of a larger batch size without the memory cost of storing activations for a large batch.
13
+
14
+ **Example 1:**
15
+ Micro batch size: 3
16
+ Gradient accumulation steps: 2
17
+ Number of GPUs: 3
18
+ Total batch size = 3 * 2 * 3 = 18
19
+
20
+ ```
21
+ | GPU 1 | GPU 2 | GPU 3 |
22
+ |----------------|----------------|----------------|
23
+ | S1, S2, S3 | S4, S5, S6 | S7, S8, S9 |
24
+ | e1, e2, e3 | e4, e5, e6 | e7, e8, e9 |
25
+ |----------------|----------------|----------------|
26
+ | → (accumulate) | → (accumulate) | → (accumulate) |
27
+ |----------------|----------------|----------------|
28
+ | S10, S11, S12 | S13, S14, S15 | S16, S17, S18 |
29
+ | e10, e11, e12 | e13, e14, e15 | e16, e17, e18 |
30
+ |----------------|----------------|----------------|
31
+ | → (apply) | → (apply) | → (apply) |
32
+
33
+ Accumulated gradient for the weight w1 after the second iteration (considering all GPUs):
34
+ Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6 + e7 + e8 + e9 + e10 + e11 + e12 + e13 + e14 + e15 + e16 + e17 + e18
35
+
36
+ Weight update for w1:
37
+ w1_new = w1_old - learning rate x (Total gradient for w1 / 18)
38
+ ```
39
+
40
+ **Example 2:**
41
+ Micro batch size: 2
42
+ Gradient accumulation steps: 1
43
+ Number of GPUs: 3
44
+ Total batch size = 2 * 1 * 3 = 6
45
+
46
+ ```
47
+ | GPU 1 | GPU 2 | GPU 3 |
48
+ |-----------|-----------|-----------|
49
+ | S1, S2 | S3, S4 | S5, S6 |
50
+ | e1, e2 | e3, e4 | e5, e6 |
51
+ |-----------|-----------|-----------|
52
+ | → (apply) | → (apply) | → (apply) |
53
+
54
+ Accumulated gradient for the weight w1 (considering all GPUs):
55
+ Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6
56
+
57
+ Weight update for w1:
58
+ w1_new = w1_old - learning rate × (Total gradient for w1 / 6)
59
+ ```
docs/dataset-formats/conversation.qmd CHANGED
@@ -1,12 +1,10 @@
1
  ---
2
  title: Conversation
3
  description: Conversation format for supervised fine-tuning.
4
- order: 1
5
  ---
6
 
7
- ## Formats
8
-
9
- ### sharegpt
10
 
11
  conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
12
 
@@ -14,15 +12,33 @@ conversations where `from` is `human`/`gpt`. (optional: first row with role `sys
14
  {"conversations": [{"from": "...", "value": "..."}]}
15
  ```
16
 
17
- Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See [the docs](../docs/config.qmd) for all config options.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- ### pygmalion
20
 
21
  ```{.json filename="data.jsonl"}
22
  {"conversations": [{"role": "...", "value": "..."}]}
23
  ```
24
 
25
- ### sharegpt.load_role
26
 
27
  conversations where `role` is used instead of `from`
28
 
@@ -30,7 +46,7 @@ conversations where `role` is used instead of `from`
30
  {"conversations": [{"role": "...", "value": "..."}]}
31
  ```
32
 
33
- ### sharegpt.load_guanaco
34
 
35
  conversations where `from` is `prompter` `assistant` instead of default sharegpt
36
 
@@ -38,34 +54,10 @@ conversations where `from` is `prompter` `assistant` instead of default sharegpt
38
  {"conversations": [{"from": "...", "value": "..."}]}
39
  ```
40
 
41
- ### sharegpt_jokes
42
 
43
  creates a chat where bot is asked to tell a joke, then explain why the joke is funny
44
 
45
  ```{.json filename="data.jsonl"}
46
  {"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
47
  ```
48
-
49
- ## How to add custom prompts for instruction-tuning
50
-
51
- For a dataset that is preprocessed for instruction purposes:
52
-
53
- ```{.json filename="data.jsonl"}
54
- {"input": "...", "output": "..."}
55
- ```
56
-
57
- You can use this example in your YAML config:
58
-
59
- ```{.yaml filename="config.yaml"}
60
- datasets:
61
- - path: repo
62
- type:
63
- system_prompt: ""
64
- field_system: system
65
- field_instruction: input
66
- field_output: output
67
- format: "[INST] {instruction} [/INST]"
68
- no_input_format: "[INST] {instruction} [/INST]"
69
- ```
70
-
71
- See full config options under [here](../docs/config.qmd).
 
1
  ---
2
  title: Conversation
3
  description: Conversation format for supervised fine-tuning.
4
+ order: 3
5
  ---
6
 
7
+ ## sharegpt
 
 
8
 
9
  conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
10
 
 
12
  {"conversations": [{"from": "...", "value": "..."}]}
13
  ```
14
 
15
+ Note: `type: sharegpt` opens special configs:
16
+ - `conversation`: enables conversions to many Conversation types. Refer to the 'name' [here](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py) for options.
17
+ - `roles`: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as `tool` etc to support masking.
18
+ - `field_human`: specify the key to use instead of `human` in the conversation.
19
+ - `field_model`: specify the key to use instead of `gpt` in the conversation.
20
+
21
+ ```yaml
22
+ datasets:
23
+ path: ...
24
+ type: sharegpt
25
+
26
+ conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
27
+ field_human: # Optional[str]. Human key to use for conversation.
28
+ field_model: # Optional[str]. Assistant key to use for conversation.
29
+ # Add additional keys from your dataset as input or output roles
30
+ roles:
31
+ input: # Optional[List[str]]. These will be masked based on train_on_input
32
+ output: # Optional[List[str]].
33
+ ```
34
 
35
+ ## pygmalion
36
 
37
  ```{.json filename="data.jsonl"}
38
  {"conversations": [{"role": "...", "value": "..."}]}
39
  ```
40
 
41
+ ## sharegpt.load_role
42
 
43
  conversations where `role` is used instead of `from`
44
 
 
46
  {"conversations": [{"role": "...", "value": "..."}]}
47
  ```
48
 
49
+ ## sharegpt.load_guanaco
50
 
51
  conversations where `from` is `prompter` `assistant` instead of default sharegpt
52
 
 
54
  {"conversations": [{"from": "...", "value": "..."}]}
55
  ```
56
 
57
+ ## sharegpt_jokes
58
 
59
  creates a chat where bot is asked to tell a joke, then explain why the joke is funny
60
 
61
  ```{.json filename="data.jsonl"}
62
  {"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
63
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/dataset-formats/inst_tune.qmd CHANGED
@@ -163,3 +163,27 @@ instruction, adds additional eos tokens
163
  ```{.json filename="data.jsonl"}
164
  {"prompt": "...", "generation": "..."}
165
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```{.json filename="data.jsonl"}
164
  {"prompt": "...", "generation": "..."}
165
  ```
166
+
167
+ ## How to add custom prompt format
168
+
169
+ For a dataset that is preprocessed for instruction purposes:
170
+
171
+ ```{.json filename="data.jsonl"}
172
+ {"input": "...", "output": "..."}
173
+ ```
174
+
175
+ You can use this example in your YAML config:
176
+
177
+ ```{.yaml filename="config.yaml"}
178
+ datasets:
179
+ - path: repo
180
+ type:
181
+ system_prompt: ""
182
+ field_system: system
183
+ field_instruction: input
184
+ field_output: output
185
+ format: "[INST] {instruction} [/INST]"
186
+ no_input_format: "[INST] {instruction} [/INST]"
187
+ ```
188
+
189
+ See full config options under [here](../config.qmd).
docs/dataset-formats/pretraining.qmd CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  title: Pre-training
3
  description: Data format for a pre-training completion task.
4
- order: 3
5
  ---
6
 
7
  For pretraining, there is no prompt template or roles. The only required field is `text`:
 
1
  ---
2
  title: Pre-training
3
  description: Data format for a pre-training completion task.
4
+ order: 1
5
  ---
6
 
7
  For pretraining, there is no prompt template or roles. The only required field is `text`:
docs/input_output.qmd CHANGED
@@ -43,7 +43,7 @@ labels so that your model can focus on predicting the outputs only.
43
  ### You may not want prompt templates
44
 
45
  However, there are many situations where you don't want to use one of
46
- these formats or templates (I usually don't!). This is because they can:
47
 
48
  - Add unnecessary boilerplate to your prompts.
49
  - Create artifacts like special delimiters `<|im_start|>` that can
 
43
  ### You may not want prompt templates
44
 
45
  However, there are many situations where you don't want to use one of
46
+ these formats or templates. This is because they can:
47
 
48
  - Add unnecessary boilerplate to your prompts.
49
  - Create artifacts like special delimiters `<|im_start|>` that can