License and Datasets used?

#1
by NilanE - opened

Hi,
What is the License for this model, and what datasets/sources were used to train this model?

Thanks.

NilanE changed discussion title from Datasets used? to License and Datasets used?

Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.

Thanks!

NilanE changed discussion status to closed

Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.

Forgot to mention, the reason I'm interested in the datasets is because I'm trying to finetune a model specifically for Japanese-to-English web novel translation. I created a very high-quality sentence-aligned parallel dataset of web novel chapters, but the scale (~100mb) wasn't enough for a good result, even using a Japanese-trained base model. So, I'm first finetuning on a large corpus of non-parallel Japanese and English web novels, then doing another finetune with the parallel dataset on top. I started with classical literature in the common domain first (which vastly increased translation quality), but the quantity, quality and relevance of the data wasn't too good (globis-university/aozorabunko-clean and ubaada/booksum-complete-cleaned) so I'm trying to integrate web novels in as well.

NilanE changed discussion status to open
NilanE changed discussion status to closed

Sign up or log in to comment