What code was this trained on?

#18
by grothetr - opened

the description doesn't provide much detail. "code" is very vague. was it trained on the o'reilly media books? opensource github projects? assembly code of windows 10? man pages for various programs? linux source code? other official reference manuals like Xlib? private code from google's internal repos?

Google org

Hi @grothetr ,It was trained on 13 trillion tokens of "primarily-English data". The exact composition of the training data and what kind of code used for 'gemma-2-27b-it' is not publicly disclosed by Google. However, You can refer to the 'gemma-2-27b-it' Model Data to have general knowledge about the dataset used for model training.

ok, thank you @Renu11 . It would however be nice to have some type of a list of datasources, to some degree of specificity. I'm surprised that it wasn't trained on any books. Wouldn't it be beneficial to train on some classic literature? Also research experiment reports that proved important theories would be awesome, then users can know that the model has read the entire milkdrop experiment paper, for example, so that they know that it is taking into account the reasoning discussed in such studies.

Sign up or log in to comment