DeepSeek Training Support

#192
by SuperXr - opened

Twinkle update: Support for DeepSeek v4 post-training is now live, including optimized support for Ascend NPU!
🔗 Read more: https://mp.weixin.qq.com/s/5AvzBlZe-BQ5hk_hdoXamw
💻 Cookbook: https://github.com/modelscope/twinkle/blob/main/cookbook/transformers/deepseek_v4_flash.py

Does DeepSeek V4 Flash start with a loss of 16? Is it using randomly initialized weights?
image

Does DeepSeek V4 Flash start with a loss of 16? Is it using randomly initialized weights?
image

We tested two configurations: the DeepSeek model architecture with the original DeepSeek weights, and the same architecture with randomly initialized weights. The figure shows the results obtained using the randomly initialized weights.

Sign up or log in to comment