Spaces:

mteb
/

leaderboard

Running on CPU Upgrade

App Files Files Community

162

New SOTA! Apply for refreshing the results

#62

by mixed-nlp - opened Jan 16, 2024

Discussion

mixed-nlp

Jan 16, 2024

Thanks for the great work!

We submitted a new SOTA model, mixedbread-ai/mxbai-mistral-7b-reprev-v1 on English MTEB. Could you help refresh this space?

Thanks!

tomaarsen

Massive Text Embedding Benchmark org Jan 16, 2024

Hello!

I've refreshed the leaderboard! Congratulations.
I'd love to help integrate your model with Sentence Transformers, would you be interested in this?
Furthermore, I'm certainly curious about more details on your model & how it was trained.

cc @Muennighoff I've noticed the leaderboard does not report the Model Size for this model. How should we resolve this?

Tom Aarsen

aamirshakir

Jan 16, 2024

•

edited Jan 16, 2024

Hey @tomaarsen ,

thank you! We would love to integrate the model into Sentence Transformers and would offer help. Also at modernising everything e.g. downloading only safetensors, ignoring onnx, updating training script etc. Maybe we can have a chat how we can contribute (aamir at mixedbread.ai) :)

Regarding training, sorry for the sparse information on that. We used the Angle Loss proposed by @SeanLee97 on a mixture of synthetic and retrieval data (extremely filtered and cleaned). We did a lot of checks to ensure that we have no data contamination and checks on different benchmarks then MTEB. We are in the process of publishing more details. Currently we are experimenting with a lot of different data mixtures, models (e.g. Phi-2, M2-Bert, Linformer) and training methods. We aim to share them with the research community, thats why we also called it research-preview. More to come soon!

Hope this helps.

Aamir.

Muennighoff

Massive Text Embedding Benchmark org Jan 16, 2024

Hm we may need to look for safetensor files as well and sum them if multiple are there

tomaarsen

Massive Text Embedding Benchmark org Jan 16, 2024

@Muennighoff I read that via huggingface_hub we can use model_info to extract the model size. I can try to invest some time to implement this.

Also at modernising everything e.g. downloading only safetensors, ignoring onnx, updating training script etc.

Downloading only safetensors & ignoring onnx is already implemented in the repo, but I've yet to push the release.
I'm certainly interested in some help with the rest, though!

I am considering some modernisations of my own:

Improved training via the transformers Trainer: multi-GPU support, gradient accumulation, gradient checkpointing, improved callbacks, bf16, etc.
Easier model loading in lower precision.
Revising how models are saved & loaded (i.e. less configuration files and no multiple weight files)
Prompt templates.
AnglE loss
Easier multiple losses (e.g. InfoNCE/NegativeMultipleRankingLoss + Cosine + AnglE)

Tom Aarsen

aamirshakir

Jan 16, 2024

Sounds extremely good! I think for a lot of training related things Tevatron is really great. We will discuss in the team and help with integrating this into sentence transformers. Really amazing what you are doing!

Muennighoff

Massive Text Embedding Benchmark org Jan 16, 2024

@Muennighoff I read that via huggingface_hub we can use model_info to extract the model size. I can try to invest some time to implement this.

Also at modernising everything e.g. downloading only safetensors, ignoring onnx, updating training script etc.

Downloading only safetensors & ignoring onnx is already implemented in the repo, but I've yet to push the release.
I'm certainly interested in some help with the rest, though!

I am considering some modernisations of my own:

Improved training via the transformers Trainer: multi-GPU support, gradient accumulation, gradient checkpointing, improved callbacks, bf16, etc.

Easier model loading in lower precision.

Revising how models are saved & loaded (i.e. less configuration files and no multiple weight files)

Prompt templates.

AnglE loss

Easier multiple losses (e.g. InfoNCE/NegativeMultipleRankingLoss + Cosine + AnglE)

Tom Aarsen

Sure that'd be amazing!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment