Set 1024 as default dim, update usage snippets, store prompts in config

#1
by tomaarsen HF staff - opened

Hello @infgrad !

Pull Request overview

  • Update default dimension to 1024
  • Update usage snippets (separate into 2, reformatting slightly)
  • Add prompts to Sentence Transformers config so they're easier to use

Preface

First of all, congratulations on the release! We've already refreshed MTEB and your new models score excellently!

image.png

I'd love to learn more about:

  1. your training approach. If it's a novel new loss, then I might want to add it to Sentence Transformers directly.
  2. your training data. Is it synthetic data? Or existing datasets? Or something new? A combination of these?

Also, very impressive to get these models (and infgrad/stella_en_1.5B_v5 especially) out so quickly. Alibaba-NLP/gte-Qwen2-1.5B-instruct is only 2 weeks old!

Details

First of all, from my open source work I recognize that the large majority of users will always stick with the default options for some software, including models. As a result, your choice of default output dimensionality has a big impact on what dimensionality people will use. I think that 8192 is simply too large for efficiently computing any downstream tasks. You mention yourself that 1024 only loses 0.001 compared to 8192, so I think you should use that as the default. By the way, I quite like that advanced users can adapt this model to their likings!

As for the consequences that it has for MTEB: I'm open to providing 2 or 3 MTEB scores per model for different output dimensionalities (e.g. 256, 1024, 8192). We've done this for OpenAI's text-embedding-3-large as well as nomic-embed-text-v1.5 before.

I've also simplified the usage snippets so they're more "copy-pasteable" than before. The Sentence Transformers snippet should now work out of the box, and the Transformers one works after cloning the model and updating the model_dir. I think this should also help with adoption, just like the better default dim. Another change is that for Sentence Transformers I recommend users to update the "path" in modules.json from e.g. 2_Dense_1024 to e.g. 2_Dense_256. Then people don't have to move any files around.

Lastly, I've added the s2p and s2s prompts to config_sentence_transformers.json. Big thanks for proposing 2 concrete prompts for people to use, I think that'll be very helpful. Sentence Transformers encode method accepts a prompt option for providing a string prompt, or a prompt_name option for providing a name for a prompt stored in config_sentence_transformers.json. I've used the latter in the usage snippet.


Please let me know if you'd like me to change anything in this PR or if you only want to merge part of it. If you like these changes, then I can also implement them for https://huggingface.co/infgrad/stella_en_1.5B_v5.

  • Tom Aarsen
tomaarsen changed pull request status to open
infgrad changed pull request status to merged

Sign up or log in to comment