view article Article wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR?? By catherinearnett • 4 days ago • 28
view article Article Building a Custom Arabic Semantic Search Model with Arabic Matryoshka Embeddings for RAG Using Sentence Transformers By Omartificial-Intelligence-Space • 7 days ago • 4
embedic Collection Say hello to Embedić, a group of new text embedding models finetuned for the Serbian language! • 3 items • Updated 23 days ago • 5
jina-embeddings-v3 Collection Multilingual multi-task general text embedding model • 6 items • Updated 13 days ago • 12
jina-embeddings-v3: Multilingual Embeddings With Task LoRA Paper • 2409.10173 • Published 16 days ago • 21
view article Article Fine-tuning a token classification model for legal data using Argilla and AutoTrain By bikashpatra • 25 days ago • 11
NanoBEIR 🍺 Collection A collection of smaller versions of BEIR datasets with 50 queries and up to 10K documents each. • 13 items • Updated 21 days ago • 3
Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text Paper • 2409.02078 • Published 28 days ago • 8
WebInstruct 🌐 Embeddings 🧱 Models Collection A collection of SoTA embeddings model fine-tuned on WebInstruct dataset to learn to pair instructions with its responses • 3 items • Updated 27 days ago • 11
view article Article ArabicWeb24: Creating a High Quality Arabic Web-only Pre-training Dataset By MayFarhat • Aug 8 • 9
BRAG-v0.1 Collection BRAG is a series of SLMs (Small Language Models) specifically trained for RAG tasks. We release models with size 1.5b, 7b and 8b. • 4 items • Updated Aug 4 • 13
Maverick Coreference Resolution Collection Efficient and Accurate Coreference Resolution models. • 3 items • Updated Jul 31 • 8
view article Article Mixedbread 🤝 deepset: Announcing our New German/English Embedding Model By shadeMe • Jul 19 • 15
BM25S: Orders of magnitude faster lexical search via eager sparse scoring Paper • 2407.03618 • Published Jul 4 • 10
view article Article ColPali: Efficient Document Retrieval with Vision Language Models 👀 By manu • Jul 5 • 109
Sentence Encoders Collection Collection of models and dataset for sentence encoder task • 4 items • Updated Jul 5 • 6
TrOCR Medieval HTR Collection This is a collection of models trained to recognize medieval scripts. • 10 items • Updated Jul 8 • 4
Arabic NLI & Semantic Similarity Datasets Collection The Arabic Version of SNLI and MultiNLI datasets, originally used for Natural Language Inference (NLI), may be used for finetuning embedding models. • 6 items • Updated Jun 18 • 4
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • Jul 9 • 35
view article Article Open-source embeddings and LLMs outperform Gemini and OpenAI for Web Navigation while being faster and cheaper By dhuynh95 • Jun 21 • 6
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 May 28 • 148
Graph-enhanced RAG Collection using knowledge graphs in RAG for grounding LLM results • 22 items • Updated 2 days ago • 7
Arabic Matryoshka Embedding Models Collection A collection of advanced Arabic Matryoshka Embedding Models designed for efficient and high-performance Arabic NLP, available publicly on Hugging Face • 9 items • Updated Aug 2 • 8
GPL BEIR Datasets Collection Generative Pseudo Labeling training datasets for all domains in BEIR. • 15 items • Updated Apr 28 • 1
🦢SWIM-IR Dataset Collection 29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs. • 4 items • Updated Apr 28 • 7
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval Paper • 2311.05800 • Published Nov 10, 2023 • 3
miniMiracle dense retrievers Collection Low foot print multilingual retrievers, all-minilm-* equivalent. • 4 items • Updated Aug 27 • 5
view article Article Introducing the Hugging Face Embedding Container for Amazon SageMaker Jun 7 • 13
view article Article Introducing NPC-Playground, a 3D playground to interact with LLM-powered NPCs Jun 5 • 17
Nomic Embed Vision Collection Vision Encoders aligned to Nomic Embed Text making Nomic Embed multimodal! • 2 items • Updated Jun 5 • 5
Hugging Face community’s Wikimedia datasets Collection Wikimedia datasets created by the Hugging Face community, not Wikimedia. Sorted by Wikimedia project. • 17 items • Updated Jun 7 • 9
Arabic NoRobots DPO Datasets Collection Our synthetic DPO datasets for Arabic NoRobots. • 4 items • Updated May 29 • 4
view article Article How to Fine-Tune Custom Embedding Models Using AutoTrain By abhishek • May 30 • 10
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations Paper • 2405.18392 • Published May 28 • 12
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated Jun 21 • 21