Tom Aarsen

tomaarsen

AI & ML interests

NLP: text embeddings, named entity recognition, few-shot text classification

Articles

Organizations

tomaarsen's activity

replied to albertvillanova's post 2 days ago
view reply

Oooh, Dataset.take should be very convenient. No more .select(range(...)) ๐Ÿš€

replied to Sentdex's post 2 days ago
view reply

I'm concerned about the low training speed (10x slower). Do we know anything about the inference latency as well? I think that's key to figure out whether this is viable or not.

replied to fdaudens's post 9 days ago
view reply

Thanks for writing out this list! I try my best to keep up, but even I missed some of these

replied to bwang0911's post 15 days ago
view reply

I quite enjoy the speed of these, well done.

replied to beomi's post 16 days ago
view reply

Nice job! What are your findings so far? Can you reasonably handle the lengths that they claim?

posted an update 16 days ago
view post
Post
2873
๐Ÿš€ Sentence Transformers v2.7.0 is out! Featuring a new loss function, easier Matryoshka model inference & evaluation, CrossEncoder improvements & Intel Gaudi2 Accelerator support. Details:

1๏ธโƒฃ A new loss function: CachedGISTEmbedLoss
This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.

2๏ธโƒฃ Automatic Matryoshka model truncation
Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a truncate_dim option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings from LlamaIndex or LangChain.

3๏ธโƒฃ Additionally, you can now specify truncate_dim in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)

4๏ธโƒฃ CrossEncoder improvements
The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support trust_remote_code to load models with custom modelling code.

5๏ธโƒฃ Inference on Intel Gaudi2
If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!

Check out the release notes for all of the details: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.7.0

I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!
  • 2 replies
ยท
replied to jamarks's post 18 days ago
view reply

Awesome! I reckon this'll make it a lot easier to quickly share, save & load some annotation work.

replied to louisbrulenaudet's post 24 days ago
view reply

Very glad to see more uses of embedding quantization, great job.

replied to trisfromgoogle's post 24 days ago
view reply

The Recurrent Gemma is very intriguing to me. I'm looking forward to reading more about the RNN-based models when I have some more spare time.

replied to urchade's post 25 days ago
replied to MoritzLaurer's post about 1 month ago
view reply

Looking forward to your blogpost! It's always exciting to see solid non-generative models.

posted an update about 1 month ago
view post
Post
2591
๐Ÿ… Quantized Embeddings are here! Unlike model quantization, embedding quantization is a post-processing step for embeddings that converts e.g. float32 embeddings to binary or int8 embeddings. This saves 32x or 4x memory & disk space, and these embeddings are much easier to compare!

Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!

Learn more about it in our blogpost in collaboration with mixedbread.ai: https://huggingface.co/blog/embedding-quantization
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval
  • 1 reply
ยท
posted an update about 2 months ago
view post
Post
๐ŸŽ‰Today, the 5000th Sentence Transformer model was uploaded to Hugging Face! Embedding models are extremely versatile, so it's no wonder that they're still being trained.

Here's a few resources to get you started with them:
- All Sentence Transformer models: https://huggingface.co/models?library=sentence-transformers&sort=trending
- Sentence Transformer documentation: https://sbert.net/
- Massive Text Embedding Benchmark (MTEB) Leaderboard: mteb/leaderboard

The embedding space is extremely active right now, so if you're using an embedding model for your retrieval, semantic similarity, reranking, classification, clustering, etc., then be sure to keep an eye out on the trending Sentence Transformer models & new models on MTEB.

Also, I'm curious if you've ever used Sentence Transformers via a third party library, like a RAG framework or vector database. I'm quite interested in more integrations to bring everyone free, efficient & powerful embedding models!
replied to giux78's post about 2 months ago
posted an update about 2 months ago
view post
Post
I remember very well that about two years ago, 0-shot named entity recognition (i.e. where you can choose any labels on the fly) was completely infeasible. Fast forward a year, and Universal-NER/UniNER-7B-all surprised me by showing that 0-shot NER is possible! However, I had a bunch of concerns that prevented me from ever adopting it myself. For example, the model was 7B parameters, only worked with 1 custom label at a time, and it had a cc-by-nc-4.0 license.

Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.

Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.

There are two models right now:
* base (english): urchade/gliner_base
* multi (multilingual): urchade/gliner_multi

And my demo to experiment with the base model can be found here: https://huggingface.co/spaces/tomaarsen/gliner_base
ยท
replied to urchade's post about 2 months ago
replied to their post about 2 months ago
view reply

I've had the same idea before as well! I think this should work as well, but I haven't had time to do the research myself. Perhaps @SeanLee97 is interested in trying this out?

posted an update 2 months ago
view post
Post
๐Ÿค— Sentence Transformers v2.4.0 for embedding models is now out! It introduces a lot of powerful features, such as:

1. Matryoshka Loss function - you can now train & perform inference on ๐Ÿช† Matryoshka Embedding models. See also our blogpost: https://huggingface.co/blog/matryoshka

2. CoSENTLoss & AnglELoss: State of the art loss functions. These are quite interesting, they outperform CosineSimilarityLoss on nearly all benchmarks as a drop-in replacement! See also the docs: https://sbert.net/docs/package_reference/losses.html#cosentloss

3. Prompt templates: Many popular models such as intfloat/multilingual-e5-large and BAAI/bge-large-en-v1.5 prefix their texts with prompts, so this adds configuration options to automatically include prompts using model.encode(..., prompt_name="query") which will include a prompt with the name "query". More info in the docs: https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates

4. Instructor support: Support for the INSTRUCTOR line of models, such as hkunlp/instructor-large. Learn how to use them here: https://sbert.net/docs/pretrained_models.html#instructor-models

5. Removed NLTK & sentencepiece dependencies: Should allow for a smaller installation & a slightly faster import!

6. Updated documentation: a new Loss Overview section: https://sbert.net/docs/training/loss_overview.html and more detailed loss functions: https://sbert.net/docs/package_reference/losses.html

And much more! See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.4.0

Some more very exciting updates are still on the horizon!
ยท
replied to Ali-C137's post 3 months ago
view reply

I've been working hard to get my HF Inbox down, but now my emails have started overflowing ๐Ÿ™ƒ

replied to stas's post 3 months ago
replied to lbourdois's post 3 months ago
view reply

I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.

replied to their post 3 months ago
posted an update 3 months ago
view post
Post
Sentence Transformers v2.3.0 has been released! It includes several bug fixes, enhanced model loading including custom models & no more unnecessary file downloads, improved performance, a powerful loss function, and much more!

Details:
โฌ† Uploading Models to the Hub with save_to_hub.
โฌ‡ Downloading Models from the Hub now downloads only necessary files.
โš™ Custom Models (such as jinaai/jina-embeddings-v2-base-de) can now be loaded with trust_remote_code=True.
๐Ÿ” Models can now be loaded at specific revisions (e.g. commit hashes or git branches).
๐Ÿ–ฅ๏ธ Various device fixes; models will now always operate on the device that you specify.
๐Ÿ“‰ A new "Cached" variant of the powerful Multiple Negatives Ranking Loss allows common hardware to reach performance previously only accessible on multi-gpu clusters.
๐ŸŽ Computation time of Community Detection was decreased significantly (7x speedup at 500k sentences :exploding_head:)
๐Ÿชถ Removed the now unnecessary "torchvision" dependency for a smaller installation.

Check out the full changelog here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.3.0

I'll be working on much more changes in the near future, so expect more exciting updates. If you encounter any issues, or have any questions or feature requests, don't hesitate to open an issue on the repository: https://github.com/UKPLab/sentence-transformers/issues
  • 1 reply
ยท