RicardoLee commited on
Commit
5808953
·
1 Parent(s): 1ca452f

README rectify

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -30,6 +30,7 @@ The training data is sampled from [BELLE](https://huggingface.co/BelleGroup) pro
30
  5. 训练起始的loss:8.7072
31
  6. 训练终止的loss:1.5674
32
 
 
33
  1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
34
  2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
35
  3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
 
30
  5. 训练起始的loss:8.7072
31
  6. 训练终止的loss:1.5674
32
 
33
+
34
  1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
35
  2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
36
  3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.