Domain adaptation

#4
by leobg - opened

Hey Aaron,

this model is great!

If I wanted to adapt this to a special domain, like German medical texts, what would be my best path forward?

Can I domain-adapt with unlabelled data, i.e. Gigabytes worth of text? Or would I have to generate labels (query, positive, negative)?

Can you give me some pointers?

Hi,
thanks!
Depending on the task. For semantic similarity / semantic search you could further finetune the model with labels. Note that the similarity should be between 0 (no similarity) and 1 (the same).
Also, consider adding tokens for example for special words before finetuning. If you want to stick with sentence-transformers this could help:
https://github.com/UKPLab/sentence-transformers/issues/744

Also, if you have enough resources you could train from scratch. Or just use a pretrained, smaller model:

https://huggingface.co/GerMedBERT/medbert-512

for semantic similarity you could either use your labeled dataset or also fine-tune on STS to prime the model on the task. First one would be more promising.

All the best
Aaron

Sign up or log in to comment