Multilanguage

#6
by DamianS89 - opened

Hey,
is there a colbert for other languages? Specifically searching for german.

If no: Do you have a recommendation about the amount of data needed, to fine tune this model to work better with german language?

Best,
Damian

Hey Damian,

Chiming in quickly, Omar might have more insight to share later on!

I've heard of some efforts to train a German ColBERT, but there isn't one yet, and there is no multilingual ColBERT yet even though it'd be really useful -- this is something on a long to-do list!

The amount of data needed seems to be fairly low (or at least), initiating from a BERT-like model in the target language rather than ColBERTv2. My own JaColBERT (ColBERT for Japanese) was initiated from a Japanese BERT and trained on just 10M triplets, all generated from MMarco + hard-negative mining, and outperformed all previous Japanese retrieval models, even when not using the knowledge-distillation techniques from ColBERTv2.

comment for notifications. Also highly interested in German or at lease Multi ColBERT!

@bclavie mind sharing your fine tuning code for general usage?

@bclavie do you mind sharing your fine tuning code ??? I'm working on Vietnamese for that but I have some problem T.T

Oh hey, sorry I missed @DamianS89 's message!

My current fine-tuning code is a bit of a mess, but it's re-implemented (and being improved, the current version is a bit wonky) in RAGatouille. When the code's fully cleaned up and improved it will also be part of the library rather than standalone!

@bclavie thank you for spending time reply us. I have been on my journey in NLP for 1 year and 2 years for deep learning. But in my country (VietNam), it is too hard to find a mentor, do you mind if we keep the contact, or maybe you can give me some recommendations to build my base skills? thank you a lot

@quan2206 I have managed to train a german ColBERT here: https://huggingface.co/domci/ColBERTv2-mmarco-de-0.1/tree/main/checkpoints/colbert-90000
I am limited on GPUs so I currently have a hard time actually testing it. Maybe it helps you?

Any version(s) for Arabic and French?
Thank you

@domci thank you for supporting me, i finally finished my Colbert model for Vietnamese but the accuracy is usually not good ( ~ 45%), and my dataset is limited . Moreover, I have a problem with highly latency . Do you have experience with that?

Sign up or log in to comment