RicardoLee
/

Llama2-chat-13B-Chinese-50W

Text Generation

llama2-chat-13B

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

RicardoLee commited on Jul 23, 2023

Commit

e3db504

•

1 Parent(s): 4a9a17f

README rectify

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -39,7 +39,7 @@ Some details in training:
 1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
 2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
 3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
-4. Training Resource: 8\*V100, 21 hours.
 5. Initial Loss: 8.2499
 6. Train Loss: 1.5674

 1. Trianing Framework: This model is trained on modified [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) Framework.
 2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
 3. Training Parameters: Due to the need to resize the embeddings, the excess embeddings are randomly initialized. As a consequence, during the initial stages of training, Deepspeed is prone to reducing the loss scale due to "OVERFLOW" issues. Frequent reductions can lead to an overly small scale, causing overflow and eventually crashing the training process. In such situations, it is not advisable to lower the learning rate, warm-up, or other hyperparameters. Instead, the recommended approach is to upscale the training parameters to Pretrain scale. This allows the randomly initialized embeddings to quickly converge to the right path.
+4. Training Resource: 8\*V100, 40 hours.
 5. Initial Loss: 8.2499
 6. Train Loss: 1.5674