Dolley 2 dataset

#9
by KnutJaegersberg - opened

Some datasets like alpaca are for research only. It would be good to have ravens which can be used for commercial ends, too.

Dolley 2 dataset has a clean license, I suppose
https://github.com/databrickslabs/dolly/tree/master/data

There are more foss instruction tuning datasets, I suppose

gpt-3 and gpt-4 might give great training data but they spoil the license / applications of your model

I +1 the idea! Fine-tuning on their dataset might lead to great results without potentially poisoning the license.

Here is the link to the article:
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

will add to v10 :)

BlinkDL changed discussion status to closed

I'm confused. gpt4allv2, based on gpt-j has apache2 license after tuning on openai api output.
Either they made a mistake or it is no problem at all to fine tune foss models on openai api output.

Don't risk it. You're highly strategic. Play save.

Sign up or log in to comment