Introduce a custom Sentence Transformer module for smooth multi-modality

by tomaarsen HF staff - opened 6 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+134

-77

tomaarsen

6 days ago

•

edited 6 days ago

Hello @infgrad !

Preface

Congratulations on this release! In your space time, too, quite impressive. Looking forward to seeing people experiment with this.
I'm also curious how you used the excellent https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu to train a strong embedding model - perhaps it can be reproduced with the brand new https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 for multilinguality as well?

Pull Request overview

Introduce a custom Sentence Transformer module, based on the common Transformer module.
Add a Normalize module as well, to always normalize (this can be useful)
Add "padding_side": "right" to the sentence_bert_config.json. FYI: this file contains the "defaults" for the new MultiModalTransformer class, so we can just adds our desired defaults there.
Update README example

Details

This model is a textbook example of why I recently added the possibility for custom Sentence Transformer Modules (docs). In short, we can extend any of the existing modules (or make completely new ones) and extend e.g. tokenize, forward, __init__, etc. As you can see here, this allows you to add multimodality without introducing any work to the end user.

My module is simply your custom tokenize and forward put into a module, with two small changes: the forward call is now in charge of ensuring that the pixel_values is of the correct dtype, and we use max_seq_length as the tokenizer max length instead of being hardcoded to 1024.

All that remains is explaining to the user what kind of inputs your tokenize method expects, i.e. what the user can input for correct outputs.

Tom Aarsen

Introduce custom Sentence Transformer module9862f98e

Use self.max_seq_length to inform the maximum tokenize lengthc0c6d644

tomaarsen changed pull request status to open 6 days ago

infgrad

Owner 6 days ago

Hi, thank you, Cannot merge, could you please do some changes on README.md?

Merge branch 'main' into pr/1, resolve merge conflict008f2574

tomaarsen

6 days ago

Definitely, I resolved the merge conflict now.

infgrad changed pull request status to merged 6 days ago

infgrad

Owner 6 days ago

Hi, @tomaarsen thank you for your PR.

Ha ha ha, I remember your PR for the Stella 1.5 model, which was concise and useful. Thank you very much!

As you know, my model is distilled from other model, so what I need is quality unsupervised text with rich sources. If fineweb-2 has better quality and is sampled from different fields, I think the result is better.

I have tried four different distiliation loss and other settings for jasper model, this will be written in my report.

tomaarsen

5 days ago

•

edited 5 days ago

Hi, @tomaarsen thank you for your PR.

Ha ha ha, I remember your PR for the Stella 1.5 model, which was concise and useful. Thank you very much!

Gladly, I always try and help improve the user experience for promising models!

Makes sense to use fineweb-edu as a high-quality source of unsupervised data. Nice work! I'm looking forward to your report to learn about the distillation losses that you've tried - I've only ever used one or two.

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment