Since you use fp16 for activation when training, I'm not sure about if it has better performance when using fp32 than fp16.
Hello. Yes, we did use fp16 during training. In our internal benchmarks, we found that there isn't a significant difference in terms of inference accuracy between fp32 and fp16. However, it's worth noting that using fp32 will not result in a decrease in performance. So if accuracy is particularly important for your downstream task, it may be worth trying fp32 as well.