Code for training / finetuning the sparse encoders

by oneryalcin - opened Jul 22

Jul 22

Forst of all many thanks for both v1 and v2 models, we are using v1 and happy with the retrieval quality in general. I'll be evaluating v2 models as soon as I can. My question is about if you have any plans to release documentation around pretraining or fine tuning the sparse encoder models. I'd like to adapt the model to our domains (to help with out of vocabulary words) and increase recall.

I'd appreciate if you could share a github repo or any blog that explains how to fine tune or pretrain these models. Many thanks again.

zhichao-geng

opensearch-project org Jul 26

Thanks for your interest on our project! We're condensing our training techniques and plan to release a paper about these details. After that we'll release the code on github. But the concrete repo to place the code is not decided yet.

oneryalcin

Jul 26

Many thanks I'll be looking forward to read the paper and test the code :)

freethenation

Aug 2

This is great work! Looking forward to the paper! Any ideas when you might release it?

zhichao-geng

opensearch-project org Aug 5

@freethenation We have finished the draft version, and now we're working on improving the structure and writing. After we finilize the paper, it still needs to go through some internal review before the paper and code can be public released. I guess we still need a few months to finish these

macavaney

Sep 17

Any updates on this?

zhichao-geng

opensearch-project org Sep 18

Any updates on this?

We're under the internal review to make them public

macavaney

Sep 18

Thanks!

zhichao-geng

opensearch-project org Nov 8

Hi @oneryalcin @freethenation @macavaney , the paper is public now: https://arxiv.org/abs/2411.04403! The code & data is still under a dedicated review process.

richard-ac

Nov 8

Super excited! Thanks for pinging this thread. Going to read your paper now! Update this thread when data & code are available too?

zhichao-geng

opensearch-project org Nov 8

Super excited! Thanks for pinging this thread. Going to read your paper now! Update this thread when data & code are available too?

Yes will give update here : )

macavaney

Nov 8

Awesome, thanks!!

oneryalcin

Nov 8

•

edited Nov 8

many thanks @zhichao-geng . I'll right dive into the paper now.

Edit: sorry couldn't help adding a podcast on this paper. I'm listening it on my way and just wanted to share:
https://notebooklm.google.com/notebook/4a37f025-66c4-40dd-b340-239f6f3ea59a/audio

zhichao-geng

opensearch-project org Nov 11

we have public the code of fine-tuning/evaluating the model(repo link). It can also be used to train a sparse model from scratch. You can reproduce the results if following the process of generating training data described in the paper.

We also aim to release the training data generated by us, but not sure whether this comply with the licenses of all used datasets and it's still under review.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment