Training datasets used for Danish, Swedish and Spanish languages

#12
by cgjoshi - opened

Hello,
I am running some evaluation tests on this model for the languages mentioned in the title. The initial results are really promising. I would like to know more. Can someone provide details on the datasets used for training this model for the mentioned languages?

cgjoshi changed discussion status to closed
cgjoshi changed discussion status to open
Sentence Transformers org

Hello!

I believe this model was finetuned with English data on top of a multilingual base model, I think https://huggingface.co/FacebookAI/xlm-roberta-base. In our experience, this is a rather capable method of getting a multilingual embedding model.
Some other models, e.g. from https://huggingface.co/models?library=sentence-transformers&language=da&sort=trending, do train with non-English datasets for their multilingual models, often (although not always) reaching even better results.

  • Tom Aarsen

Sign up or log in to comment