Is this model capable of more than just translation?

#9
by Pinestone - opened

Hi, I'm new to this area, and I noticed that this model is tagged as generative. Does it only handle translation, or can it also generate answers to questions?

I would look at the instruct-models for question-answering, for instance normistral-7b-warm-instruct. Those are further instruction trained from this model, meaning they’ve been trained on a hand-crafted, higher quality dataset on question-answering specifically.

How that performs will be up to you to evaluate, but it does answer questions.

Thank you for the information. However, I am looking for a Norwegian-trained LLM for question-answering. Is the normistral-7b-warm-instruct specifically trained on Norwegian? Looking through the stated datasets (used to train normistral-7b-warm-instruct), they seem to be only in English, which is why I'm wondering.

Per the model card, final step of the fine-tuning corpus:
“ Finally, we translated the resulting dataset into Bokmål and Nynorsk using NorMistral-7b-warm.”

See the rest of the model card for the other steps, but essentially it’s a collected, cleaned and enhanced English dataset that’s been translated to Norwegian bokmål and nynorsk. Which means the model is trained to answer questions in Norwegian, although you’d expect that it might have some English-structured looking responses sometimes. And it’s conditioned on how well the regular normistral-warm does translation (hopefully quite well).

Oh, I see now. Thank you for the clarification.

Pinestone changed discussion status to closed

I’d like to add, though: IMHO the scratch-models - trained on nothing but Norwegian (except for the code generation dataset), they should in theory produce the best Norwegian responses. That is given enough data, though, which isn’t the case yet.

With time I’d hope we can get enough purely Norwegian data to get a model that competes with ones trained with English data - maybe when the national library sorts out copyright issues and can expand the NCC by a lot? And on top of that be able to create large enough instruction-training datasets to do the instruction-training on “proper” Norwegian as well. Go have a look at the NorwAI instruction-trained models (from NTNU): their instruction training dataset has some shortcomings (currently doesn’t comply with a chat template, for instance), but it is purely in Norwegian which is pretty cool.

Sign up or log in to comment