ctheodoris/Genecorpus-30M
Preview
•
Updated
•
1.26k
•
70
I'd just start with modernBert large though, easier and strong base. Less faffing about. Also big vocab <3
They do PCA (prior to the zipf weighting) and explicitly state that they found that it improved perf.
Did you try potion/m2v as a starting point? (nvm modernbert, and it's much larger vocab)?