Disclosing training data

#13
by Vipitis - opened

Hey,

I couldn't find any details about the training data? Is it a subset of the-stack or also additionally crawled data from the web?

It would be helpful to know about potential data contamination for my creative coding benchmark.

Thanks!

I am interested in the same question. For example, the documentation does not detail the 80 codes (which codes and in which proportions?) included in the training data. Can you please add this information to the documentation?

Sign up or log in to comment