Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
grimjimΒ 
posted an update Aug 31
Post
3237
I found this paper to be thought-provoking: "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" by Bansal, Hosseini, Agarwal, Tran, and Kazemi.
https://arxiv.org/abs/2408.16737
The direct implication is that smaller models could be used to create cost-effective synthetic datasets. And on that note, in the Gemma terms of use, Google explicitly claims no rights on outputs generated from those models, which means one is free to synthgen from the Gemma line. Meta's Llama 3 licence forbids synthetic generation of outputs if used to improve other models. Relevant Mistral, Qwen, and Yi models under the Apache 2.0 license are unrestricted for this purpose.

I was still thinking that when the ll2 came out! We have to generate low quality high quantity datasets. As these lms are just sampling answers from the training dataset, its just better to have more data. And ll3 came out with 15T tokens... the base models perf wasnt near the inst tuned one, because the low quality hurted the perf, but as there are 15t tokens in the dataset, after the inst tuning the model is pretty good.

So we just have to have 2 datasets, pretrain (really big but low quality) and refinement (small but high quality) to refine the chaotic high volume model.

I am not a lawyer but I remember reading and article about AI generated work not being able to be copyrighted. That means that technically speaking all of them are under public domain and we can use it however we want no? You might still be violating terms of service of for example OpenAI and they can terminate your account but I don't think they can legally sue you for using their AI output to train new models.

Just food for thought.

In this post