RicardoLee commited on
Commit
e3db504
1 Parent(s): 4a9a17f

README rectify

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -39,7 +39,7 @@ Some details in training:
39
  1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
40
  2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
41
  3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
42
- 4. Training Resource: 8\*V100, 21 hours.
43
  5. Initial Loss: 8.2499
44
  6. Train Loss: 1.5674
45
 
 
39
  1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
40
  2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
41
  3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
42
+ 4. Training Resource: 8\*V100, 40 hours.
43
  5. Initial Loss: 8.2499
44
  6. Train Loss: 1.5674
45