New SOTA! Apply for refreshing the results

#62
by mixed-nlp - opened

Thanks for the great work!

We submitted a new SOTA model, mixedbread-ai/mxbai-mistral-7b-reprev-v1 on English MTEB. Could you help refresh this space?

Thanks!

Massive Text Embedding Benchmark org

Hello!

I've refreshed the leaderboard! Congratulations.
I'd love to help integrate your model with Sentence Transformers, would you be interested in this?
Furthermore, I'm certainly curious about more details on your model & how it was trained.

cc @Muennighoff I've noticed the leaderboard does not report the Model Size for this model. How should we resolve this?

  • Tom Aarsen

Hey @tomaarsen ,

thank you! We would love to integrate the model into Sentence Transformers and would offer help. Also at modernising everything e.g. downloading only safetensors, ignoring onnx, updating training script etc. Maybe we can have a chat how we can contribute (aamir at mixedbread.ai) :)

Regarding training, sorry for the sparse information on that. We used the Angle Loss proposed by @SeanLee97 on a mixture of synthetic and retrieval data (extremely filtered and cleaned). We did a lot of checks to ensure that we have no data contamination and checks on different benchmarks then MTEB. We are in the process of publishing more details. Currently we are experimenting with a lot of different data mixtures, models (e.g. Phi-2, M2-Bert, Linformer) and training methods. We aim to share them with the research community, thats why we also called it research-preview. More to come soon!

Hope this helps.

Aamir.

Massive Text Embedding Benchmark org

Hm we may need to look for safetensor files as well and sum them if multiple are there

Massive Text Embedding Benchmark org

@Muennighoff I read that via huggingface_hub we can use model_info to extract the model size. I can try to invest some time to implement this.

Also at modernising everything e.g. downloading only safetensors, ignoring onnx, updating training script etc.

Downloading only safetensors & ignoring onnx is already implemented in the repo, but I've yet to push the release.
I'm certainly interested in some help with the rest, though!

I am considering some modernisations of my own:

  • Improved training via the transformers Trainer: multi-GPU support, gradient accumulation, gradient checkpointing, improved callbacks, bf16, etc.
  • Easier model loading in lower precision.
  • Revising how models are saved & loaded (i.e. less configuration files and no multiple weight files)
  • Prompt templates.
  • AnglE loss
  • Easier multiple losses (e.g. InfoNCE/NegativeMultipleRankingLoss + Cosine + AnglE)
  • Tom Aarsen

Sounds extremely good! I think for a lot of training related things Tevatron is really great. We will discuss in the team and help with integrating this into sentence transformers. Really amazing what you are doing!

Massive Text Embedding Benchmark org

@Muennighoff I read that via huggingface_hub we can use model_info to extract the model size. I can try to invest some time to implement this.

Also at modernising everything e.g. downloading only safetensors, ignoring onnx, updating training script etc.

Downloading only safetensors & ignoring onnx is already implemented in the repo, but I've yet to push the release.
I'm certainly interested in some help with the rest, though!

I am considering some modernisations of my own:

  • Improved training via the transformers Trainer: multi-GPU support, gradient accumulation, gradient checkpointing, improved callbacks, bf16, etc.
  • Easier model loading in lower precision.
  • Revising how models are saved & loaded (i.e. less configuration files and no multiple weight files)
  • Prompt templates.
  • AnglE loss
  • Easier multiple losses (e.g. InfoNCE/NegativeMultipleRankingLoss + Cosine + AnglE)
  • Tom Aarsen

Sure that'd be amazing!

Sign up or log in to comment