General free-text feature encoder

#156
by thomas-pocreau - opened

Hi,
I assume I'm not the only one that used this benchmark with a goal in mind: find the best open weight alternative to GCP (OpenAI/Mistral) embedding APIs. Those APIs are really usefull as general free-text feature encoder within a machine learning model. On GCP, you just specify if you want an embedding optimized for classification, regression, ... and the resulting embedding will be usefull as input for several machine learning tasks. You pay once and can use the embedding several times for several goals.

Looking at the leaderboard make it really easy to identify the best models, and deploying an embedding API is a breath with Text Embeddings Inference. gte-Qwen2-1.5B-instruct seems like a nice tradeoff for my specfic use-case. Now, the only thing that remains is to find the best instruct for classification !

And here is the catch, looking at the instruct used in the benchmark, I was not able to find a usefull instruct.
To my surprise, for classification tasks, the eval task seems to rely on the instruct to actually prompt the embedding model into doing the classification and a logistic regression or k neighbors classifier will simply find the subpart of the space in the embedding that is capturing the meaning of the classes defined in the prompt.

This means that the benchmark is actually evaluating the capacity of each model to do classification and not the ability of the model to encode text that can then be used as input by a classifier as documented by OpenAI here.

I believe if would make sense to add a new task that would focus on general free-text feature encoder. Something that would rely on instruct that don't make any assumption on the underlying classification or regression tasks managed by a RandomForest or XGBoost.

Regards,
Thomas

Discussion was posted on Github too https://github.com/embeddings-benchmark/mteb/discussions/2211

thomas-pocreau changed discussion title from General free-text feature encoder #2211 to General free-text feature encoder
Massive Text Embedding Benchmark org

Let us grab this discussion over on GitHub :)

thomas-pocreau changed discussion status to closed

Sign up or log in to comment