RicardoLee
commited on
Commit
•
e3db504
1
Parent(s):
4a9a17f
README rectify
Browse files
README.md
CHANGED
@@ -39,7 +39,7 @@ Some details in training:
|
|
39 |
1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
|
40 |
2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
|
41 |
3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
|
42 |
-
4. Training Resource: 8\*V100,
|
43 |
5. Initial Loss: 8.2499
|
44 |
6. Train Loss: 1.5674
|
45 |
|
|
|
39 |
1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
|
40 |
2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
|
41 |
3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
|
42 |
+
4. Training Resource: 8\*V100, 40 hours.
|
43 |
5. Initial Loss: 8.2499
|
44 |
6. Train Loss: 1.5674
|
45 |
|