About the number of training tokens
#1
by
Enigrand
- opened
Hi,
The comparison between the hybrid model and transformer are trained with 3.5t tokens as described in your paper, which is not consistent with the naming here.
Is there something I'm missing?
rwaleffe
changed discussion status to
closed
rwaleffe
changed discussion status to
open
You aren't missing anything. These models were trained for 3.5T tokens as described in the paper. The 3.5T tokens has just been shortened to "3t" here for the naming.
rwaleffe
changed discussion status to
closed