training data

by simeneide - opened Dec 1, 2023

Dec 1, 2023

Hi, you write in the doc:

Training data
NB-GPT-J-6B was finetuned on NCC, the Norwegian Colossal Corpus, plus other Internet sources like Wikipedia, mC4, and OSCAR.

Is there any news source in these "other" category, and do you have an approximate amount?

versae

Nasjonalbiblioteket AI Lab org Dec 8, 2023

Yes. It is documented as newspapers_online_nb and newspapers_ocr in NCC.

versae changed discussion status to closed Feb 5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment