Introduce a custom Sentence Transformer module for smooth multi-modality

#1
by tomaarsen HF staff - opened

Hello @infgrad !

Preface

Congratulations on this release! In your space time, too, quite impressive. Looking forward to seeing people experiment with this.
I'm also curious how you used the excellent https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu to train a strong embedding model - perhaps it can be reproduced with the brand new https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 for multilinguality as well?

Pull Request overview

  • Introduce a custom Sentence Transformer module, based on the common Transformer module.
  • Add a Normalize module as well, to always normalize (this can be useful)
  • Add "padding_side": "right" to the sentence_bert_config.json. FYI: this file contains the "defaults" for the new MultiModalTransformer class, so we can just adds our desired defaults there.
  • Update README example

Details

This model is a textbook example of why I recently added the possibility for custom Sentence Transformer Modules (docs). In short, we can extend any of the existing modules (or make completely new ones) and extend e.g. tokenize, forward, __init__, etc. As you can see here, this allows you to add multimodality without introducing any work to the end user.

My module is simply your custom tokenize and forward put into a module, with two small changes: the forward call is now in charge of ensuring that the pixel_values is of the correct dtype, and we use max_seq_length as the tokenizer max length instead of being hardcoded to 1024.

All that remains is explaining to the user what kind of inputs your tokenize method expects, i.e. what the user can input for correct outputs.

  • Tom Aarsen
tomaarsen changed pull request status to open

Hi, thank you, Cannot merge, could you please do some changes on README.md?

Definitely, I resolved the merge conflict now.

infgrad changed pull request status to merged

Hi, @tomaarsen thank you for your PR.

Ha ha ha, I remember your PR for the Stella 1.5 model, which was concise and useful. Thank you very much!

As you know, my model is distilled from other model, so what I need is quality unsupervised text with rich sources. If fineweb-2 has better quality and is sampled from different fields, I think the result is better.

I have tried four different distiliation loss and other settings for jasper model, this will be written in my report.

Hi, @tomaarsen thank you for your PR.

Ha ha ha, I remember your PR for the Stella 1.5 model, which was concise and useful. Thank you very much!

Gladly, I always try and help improve the user experience for promising models!

Makes sense to use fineweb-edu as a high-quality source of unsupervised data. Nice work! I'm looking forward to your report to learn about the distillation losses that you've tried - I've only ever used one or two.

  • Tom Aarsen

Sign up or log in to comment