Tom Aarsen's picture

Tom Aarsen

tomaarsen

·

https://linkedin.com/in/tomaarsen

tomaarsen

tomaarsen

AI & ML interests

NLP: text embeddings, named entity recognition, few-shot text classification

Articles

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

🪆 Introduction to Matryoshka Embedding Models

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

🕳️ Attention Sinks in LLMs for endless fluency

Organizations

Posts 8

Post

990

NuMind has just released 3 new state-of-the-art GLiNER models for Named Entity Recognition/Information Extraction. These GLiNER models allow you to specify any label that you want, and it'll find spans in the text corresponding to your label. It's been shown to work quite well on unusual domains, e.g. celestial entities in my picture.

There are 3 models released:
- numind/NuNER_Zero:
The primary model, SOTA & can detect really long entities.
- numind/NuNER_Zero-span:
Slightly better performance than NuNER Zero, but can't detect entities longer than 12 tokens.
- numind/NuNER_Zero-4k:
Slightly worse than NuNER Zero, but has a context length of 4k tokens.

Some more details about these models in general:
- They are *really* small, orders of magnitude smaller than LLMs, which don't reach this level of performance.
- Because they're small - they're fast: <1s per sentence on free GPUs.
- They have an MIT license: free commercial usage.

Try out the demo here: numind/NuZero
Or check out all of the models here: numind/nunerzero-zero-shot-ner-662b59803b9b438ff56e49e2

If there's ever a need for me to extract some information from any text: I'll be using these. Great work @Serega6678 !

Post

2412

I've just stumbled upon some excellent work on (🇫🇷 French) retrieval models by @antoinelouis . Kudos to him!

- French Embedding Models: https://huggingface.co/collections/antoinelouis/dense-single-vector-bi-encoders-651523c0c75a3d4c44fc864d
- French Reranker Models: antoinelouis/cross-encoder-rerankers-651523f16efa656d1788a239
- French Multi-vector Models: https://huggingface.co/collections/antoinelouis/dense-multi-vector-bi-encoders-6589a8ee6b17c06872e9f075
- Multilingual Models: https://huggingface.co/collections/antoinelouis/modular-retrievers-65d53d0db64b1d644aea620c

A lot of these models use the MS MARCO Hard Negatives dataset, which I'm currently reformatting to be more easily usable. Notably, they should work out of the box without any pre-processing for training embedding models in the upcoming Sentence Transformers v3.

Collections 9

spaces 1

GLiNER-medium-v2.1, zero-shot NER

models 69

tomaarsen/xlm-roberta-base-multilingual-en-ar-fr-de-es-tr-it

Sentence Similarity • Updated 12 days ago • 79 • 2

tomaarsen/distilbert-base-uncased-wikipedia-sections-triplet

Sentence Similarity • Updated 12 days ago

tomaarsen/bert-base-uncased-multi-task

Sentence Similarity • Updated 12 days ago

tomaarsen/stsb-distilbert-base-mnrl-cl-multi

Sentence Similarity • Updated 12 days ago

tomaarsen/distilroberta-base-paraphrases-multi

Sentence Similarity • Updated 12 days ago

tomaarsen/stsb-distilbert-base-mnrl

Sentence Similarity • Updated 12 days ago

tomaarsen/stsb-distilbert-base-ocl

Sentence Similarity • Updated 12 days ago

tomaarsen/distilbert-base-uncased-sts

Sentence Similarity • Updated 12 days ago

tomaarsen/all-mpnet-base-v2-sts

Sentence Similarity • Updated 12 days ago

tomaarsen/distilroberta-base-nli-v3

Sentence Similarity • Updated 12 days ago

datasets 8

tomaarsen/quantized-wikipedia-indices

tomaarsen/ner-orgs

Viewer • Updated Nov 22, 2023 • 19 • 4

tomaarsen/setfit-absa-semeval-laptops

Viewer • Updated Nov 16, 2023 • 154 • 1

tomaarsen/setfit-absa-semeval-restaurants

Viewer • Updated Nov 16, 2023 • 237

tomaarsen/MultiCoNER

Viewer • Updated Oct 1, 2023 • 6 • 1

tomaarsen/conll2002

Viewer • Updated Sep 23, 2023

tomaarsen/conllpp

Viewer • Updated Jun 1, 2023 • 2

tomaarsen/conll2003

Viewer • Updated May 8, 2023 • 16