Tom Aarsen
AI & ML interests
Articles
Organizations
tomaarsen's activity
Oooh, Dataset.take
should be very convenient. No more .select(range(...))
๐
I'm concerned about the low training speed (10x slower). Do we know anything about the inference latency as well? I think that's key to figure out whether this is viable or not.
Thanks for writing out this list! I try my best to keep up, but even I missed some of these
I quite enjoy the speed of these, well done.
Nice job! What are your findings so far? Can you reasonably handle the lengths that they claim?
1๏ธโฃ A new loss function: CachedGISTEmbedLoss
This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.
2๏ธโฃ Automatic Matryoshka model truncation
Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a
truncate_dim
option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings
from LlamaIndex or LangChain.3๏ธโฃ Additionally, you can now specify
truncate_dim
in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)4๏ธโฃ CrossEncoder improvements
The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support
trust_remote_code
to load models with custom modelling code.5๏ธโฃ Inference on Intel Gaudi2
If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!
Check out the release notes for all of the details: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.7.0
I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!
Awesome! I reckon this'll make it a lot easier to quickly share, save & load some annotation work.
Very glad to see more uses of embedding quantization, great job.
The Recurrent Gemma is very intriguing to me. I'm looking forward to reading more about the RNN-based models when I have some more spare time.
Very exciting! I see you've already created a demo for it here: https://huggingface.co/spaces/urchade/gliner_multiv2.1
Looking forward to your blogpost! It's always exciting to see solid non-generative models.
float32
embeddings to binary or int8
embeddings. This saves 32x or 4x memory & disk space, and these embeddings are much easier to compare!Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!
Learn more about it in our blogpost in collaboration with mixedbread.ai: https://huggingface.co/blog/embedding-quantization
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval
Here's a few resources to get you started with them:
- All Sentence Transformer models: https://huggingface.co/models?library=sentence-transformers&sort=trending
- Sentence Transformer documentation: https://sbert.net/
- Massive Text Embedding Benchmark (MTEB) Leaderboard: mteb/leaderboard
The embedding space is extremely active right now, so if you're using an embedding model for your retrieval, semantic similarity, reranking, classification, clustering, etc., then be sure to keep an eye out on the trending Sentence Transformer models & new models on MTEB.
Also, I'm curious if you've ever used Sentence Transformers via a third party library, like a RAG framework or vector database. I'm quite interested in more integrations to bring everyone free, efficient & powerful embedding models!
It seems that the Space has moved to: https://huggingface.co/spaces/DeepMount00/universal_ner_ita
And the model is now public: https://huggingface.co/DeepMount00/universal_ner_ita
Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.
Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.
There are two models right now:
* base (english): urchade/gliner_base
* multi (multilingual): urchade/gliner_multi
And my demo to experiment with the base model can be found here: https://huggingface.co/spaces/tomaarsen/gliner_base
I made a demo for the base model! It works like a charm: https://huggingface.co/spaces/tomaarsen/gliner_base
I've had the same idea before as well! I think this should work as well, but I haven't had time to do the research myself. Perhaps @SeanLee97 is interested in trying this out?
1. Matryoshka Loss function - you can now train & perform inference on ๐ช Matryoshka Embedding models. See also our blogpost: https://huggingface.co/blog/matryoshka
2. CoSENTLoss & AnglELoss: State of the art loss functions. These are quite interesting, they outperform CosineSimilarityLoss on nearly all benchmarks as a drop-in replacement! See also the docs: https://sbert.net/docs/package_reference/losses.html#cosentloss
3. Prompt templates: Many popular models such as intfloat/multilingual-e5-large and BAAI/bge-large-en-v1.5 prefix their texts with prompts, so this adds configuration options to automatically include prompts using
model.encode(..., prompt_name="query")
which will include a prompt with the name "query". More info in the docs: https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates4. Instructor support: Support for the INSTRUCTOR line of models, such as hkunlp/instructor-large. Learn how to use them here: https://sbert.net/docs/pretrained_models.html#instructor-models
5. Removed NLTK & sentencepiece dependencies: Should allow for a smaller installation & a slightly faster import!
6. Updated documentation: a new Loss Overview section: https://sbert.net/docs/training/loss_overview.html and more detailed loss functions: https://sbert.net/docs/package_reference/losses.html
And much more! See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.4.0
Some more very exciting updates are still on the horizon!
I've been working hard to get my HF Inbox down, but now my emails have started overflowing ๐
Awesome! Very promising
I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.
I've just uploaded v2.3.1 as well! It includes a niche bug fix for some local models. See more details here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.3.1
Details:
โฌ Uploading Models to the Hub with
save_to_hub
.โฌ Downloading Models from the Hub now downloads only necessary files.
โ Custom Models (such as jinaai/jina-embeddings-v2-base-de) can now be loaded with
trust_remote_code=True
.๐ Models can now be loaded at specific revisions (e.g. commit hashes or git branches).
๐ฅ๏ธ Various device fixes; models will now always operate on the device that you specify.
๐ A new "Cached" variant of the powerful Multiple Negatives Ranking Loss allows common hardware to reach performance previously only accessible on multi-gpu clusters.
๐ Computation time of Community Detection was decreased significantly (7x speedup at 500k sentences :exploding_head:)
๐ชถ Removed the now unnecessary "torchvision" dependency for a smaller installation.
Check out the full changelog here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.3.0
I'll be working on much more changes in the near future, so expect more exciting updates. If you encounter any issues, or have any questions or feature requests, don't hesitate to open an issue on the repository: https://github.com/UKPLab/sentence-transformers/issues