Training datasets used for Danish, Swedish and Spanish languages

#12

by cgjoshi - opened Oct 17

Oct 17

Hello,
I am running some evaluation tests on this model for the languages mentioned in the title. The initial results are really promising. I would like to know more. Can someone provide details on the datasets used for training this model for the mentioned languages?

cgjoshi changed discussion status to closed Oct 17

cgjoshi changed discussion status to open Oct 17

tomaarsen

Sentence Transformers org Oct 17

Hello!

I believe this model was finetuned with English data on top of a multilingual base model, I think https://huggingface.co/FacebookAI/xlm-roberta-base. In our experience, this is a rather capable method of getting a multilingual embedding model.
Some other models, e.g. from https://huggingface.co/models?library=sentence-transformers&language=da&sort=trending, do train with non-English datasets for their multilingual models, often (although not always) reaching even better results.

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment