Add docs (#947)
Browse files* move section
* update README
* update README
* update README
* update README
* update README
* Update README.md
Co-authored-by: Wing Lian <wing.lian@gmail.com>
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
README.md
CHANGED
@@ -36,7 +36,9 @@ Features:
|
|
36 |
- [Train](#train)
|
37 |
- [Inference](#inference)
|
38 |
- [Merge LORA to Base](#merge-lora-to-base)
|
|
|
39 |
- [Common Errors](#common-errors-)
|
|
|
40 |
- [Need Help?](#need-help-)
|
41 |
- [Badge](#badge-)
|
42 |
- [Community Showcase](#community-showcase)
|
@@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended):
|
|
251 |
```json
|
252 |
{"conversations": [{"from": "...", "value": "..."}]}
|
253 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
254 |
- `completion`: raw corpus
|
255 |
```json
|
256 |
{"text": "..."}
|
@@ -970,6 +979,22 @@ wandb_name:
|
|
970 |
wandb_log_model:
|
971 |
```
|
972 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
973 |
### Inference
|
974 |
|
975 |
Pass the appropriate flag to the train command:
|
@@ -1048,6 +1073,20 @@ It's safe to ignore it.
|
|
1048 |
|
1049 |
See the [NCCL](docs/nccl.md) guide.
|
1050 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1051 |
## Need help? πβοΈ
|
1052 |
|
1053 |
Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
|
|
|
36 |
- [Train](#train)
|
37 |
- [Inference](#inference)
|
38 |
- [Merge LORA to Base](#merge-lora-to-base)
|
39 |
+
- [Special Tokens](#special-tokens)
|
40 |
- [Common Errors](#common-errors-)
|
41 |
+
- [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
|
42 |
- [Need Help?](#need-help-)
|
43 |
- [Badge](#badge-)
|
44 |
- [Community Showcase](#community-showcase)
|
|
|
253 |
```json
|
254 |
{"conversations": [{"from": "...", "value": "..."}]}
|
255 |
```
|
256 |
+
- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
|
257 |
+
```yml
|
258 |
+
datasets:
|
259 |
+
- path: <your-path>
|
260 |
+
type: sharegpt
|
261 |
+
conversation: llama-2
|
262 |
+
```
|
263 |
- `completion`: raw corpus
|
264 |
```json
|
265 |
{"text": "..."}
|
|
|
979 |
wandb_log_model:
|
980 |
```
|
981 |
|
982 |
+
##### Special Tokens
|
983 |
+
|
984 |
+
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
|
985 |
+
|
986 |
+
```yml
|
987 |
+
special_tokens:
|
988 |
+
bos_token: "<s>"
|
989 |
+
eos_token: "</s>"
|
990 |
+
unk_token: "<unk>"
|
991 |
+
tokens: # these are delimiters
|
992 |
+
- "<|im_start|>"
|
993 |
+
- "<|im_end|>"
|
994 |
+
```
|
995 |
+
|
996 |
+
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
|
997 |
+
|
998 |
### Inference
|
999 |
|
1000 |
Pass the appropriate flag to the train command:
|
|
|
1073 |
|
1074 |
See the [NCCL](docs/nccl.md) guide.
|
1075 |
|
1076 |
+
|
1077 |
+
### Tokenization Mismatch b/w Inference & Training
|
1078 |
+
|
1079 |
+
For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
|
1080 |
+
|
1081 |
+
If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:
|
1082 |
+
|
1083 |
+
1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
|
1084 |
+
2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
|
1085 |
+
3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly.
|
1086 |
+
4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical.
|
1087 |
+
|
1088 |
+
Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
|
1089 |
+
|
1090 |
## Need help? πβοΈ
|
1091 |
|
1092 |
Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
|