Removing the word "exclusively" since it implies the model is ONLY trained on the data described in the section

#7

The team who trained NORA likely doesn't know what MistralAi used to train Mistral-7b-v0.1, (since MistralAI haven't revealed it to avoid legal action) and as such cannot make claims about whether mistral was trained exclusively on public web data.
Clarified this in the section so that somebody who is just skimming the text (or an LLM is trying to summarize the page) doesn't miss the fact that this model is Mistral-Unknown + Nora Corpus when looking for details around the section "pretraining corpus"

It is LIKELY that Mistral trained on public data like Common Crawl, but it is impossible to claim it was done exclusively: It might very well have been trained on books or other corpora that are not on the open internet, private datasets, etc.

Norwegian Large Language Models org

Changed the sentence to "The model is continually pretrained exclusively on publicly available data" to make it more clear. We say that the model is initialized from Mistral right in the first sentence of the readme file, as well as in the name of the model itself. Thanks!

davda54 changed pull request status to closed

Sign up or log in to comment