905 173 626

Tom Aarsen

tomaarsen

https://linkedin.com/in/tomaarsen

AI & ML interests

NLP: text embeddings, information retrieval, named entity recognition, few-shot text classification

Recent Activity

liked a model about 7 hours ago

nomic-ai/modernbert-embed-base

upvoted an article about 11 hours ago

Fine-tune ModernBERT for text classification using synthetic data

liked a model about 12 hours ago

ChocoLlama/Llama-3-ChocoLlama-8B-instruct

View all activity

Articles

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Mar 22

• 69

🪆 Introduction to Matryoshka Embedding Models

Feb 23

• 62

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

Dec 6, 2023

• 6

🕳️ Attention Sinks in LLMs for endless fluency

Oct 9, 2023

• 7

Organizations

tomaarsen's activity

upvoted an article about 11 hours ago

Article

Fine-tune ModernBERT for text classification using synthetic data

•

about 14 hours ago

• 1

upvoted a collection 3 days ago

Granite 3.1 Language Models

Collection

A series of language models with 128K context length trained by IBM licensed under Apache 2.0 license. • 8 items • Updated 13 days ago • 41

upvoted 2 papers 11 days ago

Spectrum: Targeted Training on Signal to Noise Ratio

Paper • 2406.06623 • Published Jun 7 • 12

Qwen2.5 Technical Report

Paper • 2412.15115 • Published 11 days ago • 331

upvoted an article 11 days ago

Article

Use Models from the Hugging Face Hub in LM Studio

•

Nov 28

• 127

upvoted a paper 11 days ago

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 13 days ago • 113

upvoted a collection 11 days ago

ModernBERT

Collection

Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated 11 days ago • 105

upvoted an article 25 days ago

Article

Building a Local Vector Database Index with Annoy and Sentence Transformers

•

25 days ago

• 3

upvoted an article 26 days ago

Article

🐺🐦‍⬛ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

•

26 days ago

• 73

upvoted an article 27 days ago

Article

Accelerating Embedding & Reranking Models on AMD Using Infinity

•

28 days ago

• 4

upvoted an article 28 days ago

Article

EuroLLM-9B

•

29 days ago

• 104

upvoted a paper about 1 month ago

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

Paper • 2411.12946 • Published Nov 20 • 20

upvoted a collection about 1 month ago

Models for dataset curation

Collection

9 items • Updated 25 days ago • 17

upvoted an article about 1 month ago

Article

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

•

Nov 21

• 34

upvoted a paper about 1 month ago

Drowning in Documents: Consequences of Scaling Reranker Inference

Paper • 2411.11767 • Published Nov 18 • 17

upvoted an article about 1 month ago

Article

Halo: Open Source Health Tracking with Wearables

•

Nov 19

• 99

upvoted an article about 2 months ago

Article

Releasing the largest multilingual open pretraining dataset

•

Nov 13

• 98

upvoted a collection about 2 months ago

Training with Prompts

Collection

See the Training with Prompts documentation for more details: https://sbert.net/examples/training/prompts/README.html • 5 items • Updated Nov 7 • 3

upvoted an article about 2 months ago

Article

Releasing Common Corpus: the largest public domain dataset for training LLMs

•

Mar 20

• 18

upvoted a collection 2 months ago

Model2Vec base models

Collection

These are the Minishlab Model2Vec base models. Load them and use them with model2vec (https://github.com/MinishLab/model2vec) or sentence-transformers • 7 items • Updated 16 days ago • 8

Tom Aarsen

AI & ML interests

Recent Activity

Articles

Finally, a Replacement for BERT: Introducing ModernBERT

Welcome Gemma 2 - Google's new open LLM

Training and Finetuning Embedding Models with Sentence Transformers v3

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

🪆 Introduction to Matryoshka Embedding Models

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

🕳️ Attention Sinks in LLMs for endless fluency

Organizations

tomaarsen's activity

Fine-tune ModernBERT for text classification using synthetic data

Use Models from the Hugging Face Hub in LM Studio

Building a Local Vector Database Index with Annoy and Sentence Transformers

🐺🐦‍⬛ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

Accelerating Embedding & Reranking Models on AMD Using Infinity

EuroLLM-9B

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

Halo: Open Source Health Tracking with Wearables

Releasing the largest multilingual open pretraining dataset

Releasing Common Corpus: the largest public domain dataset for training LLMs