About class_langs

#1
by AlirezaFa - opened

I want to use this for a multilingual retrieval application and I don't know the language of the query from the users. The text_features function expects a langs parameter that defines the language of each text and in case no langs is passed, it defaults to eng_Latn. In case we don't know the language of the text and it is not eng_Latn, what happens? Do we get a big drop in performance?

From my experiments, text embeddings in different languages are quite similar (which is good and expected). Because of that, if you incorrectly specify the language, you will still likely get correct results. In terms of quantitative results, recall drops by about 10% with incorrect language.

Sign up or log in to comment