@KnutJaegersberg on Hugging Face: "Microsoft: Improving Text Embeddings with Large Language Models

posted an update Jan 2

Post

Microsoft: Improving Text Embeddings with Large Language Models

- uses an LLM instead of complex pipelines to create the training data
- directly generates data for numerous text embedding tasks
- fine tunes standard models with contrastative loss achieving great performance
- critical thought: isn't this kinda benchmark hacking? If the benchmarks are so encompassing that they capture the complete idea of embedding, it's maybe a good idea, but often it is oversimplifying, I find.

Feel free to share your thoughts, even if they like mine don't beat the benchmarks ;P

https://arxiv.org/abs/2401.00368

sbrandeis

Jan 3

•

edited Jan 3

Linking the HF paper page as well: http://huggingface.co/papers/2401.00368

The fact they used only synthetic data is huge IMO - makes this almost an unsupervised training setup

I guess we'll see more and more techniques like that based on foundational LLMs!

alvarobartt

Jan 5

In fact, to add more context, the authors mentioned that they will release some more content in the upcoming revision of the paper which is nice, because that would imply that anyone could run a faithful reproduction of their synthetic data generation process. See the reply from the authors at https://huggingface.co/papers/2401.00368#65978d195f689f3f0b2caeb9.

Also worth mentioning that @andersonbcdefg ran both stages:

Task definition generation at https://huggingface.co/datasets/andersonbcdefg/synthetic_retrieval_tasks
query-pos-neg triplets generation at https://huggingface.co/datasets/andersonbcdefg/synthetic_tuples_gpt35_turbo

(Unsure if the reproduction of the second stage is faithful to the original, but asked them at https://twitter.com/alvarobartt/status/1742839431881490717, anyway I think we may need to wait for the authors to share the full details on the prompting strategies for the generation).

Join the conversation