Elizezen/Nocturn-7B · License and Datasets used?

NilanE

Feb 6

•

edited Feb 6

Hi,
What is the License for this model, and what datasets/sources were used to train this model?

Thanks.

NilanE changed discussion title from Datasets used? to License and Datasets used? Feb 6

Elizezen

Owner Feb 14

Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.

NilanE

Feb 14

Thanks!

NilanE changed discussion status to closed Feb 14

NilanE

Feb 14

•

edited Feb 14

Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.

Forgot to mention, the reason I'm interested in the datasets is because I'm trying to finetune a model specifically for Japanese-to-English web novel translation. I created a very high-quality sentence-aligned parallel dataset of web novel chapters, but the scale (~100mb) wasn't enough for a good result, even using a Japanese-trained base model. So, I'm first finetuning on a large corpus of non-parallel Japanese and English web novels, then doing another finetune with the parallel dataset on top. I started with classical literature in the common domain first (which vastly increased translation quality), but the quantity, quality and relevance of the data wasn't too good (globis-university/aozorabunko-clean and ubaada/booksum-complete-cleaned) so I'm trying to integrate web novels in as well.

NilanE changed discussion status to open Feb 14

NilanE changed discussion status to closed Feb 14