Vatolin Alexey's picture

5 6 4

Vatolin Alexey

vatolinalex

·

AI & ML interests

None yet

Recent Activity

upvoted a paper 7 days ago

Training Sparse Mixture Of Experts Text Embedding Models

liked a model 7 days ago

EuroBERT/EuroBERT-210m

reacted to tomaarsen's post with ❤️ 7 days ago

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc. 🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi 3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion ➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common. ⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported. 🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models 📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight. 📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code. Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release * https://huggingface.co/EuroBERT/EuroBERT-210m * https://huggingface.co/EuroBERT/EuroBERT-610m * https://huggingface.co/EuroBERT/EuroBERT-2.1B The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!

View all activity

Organizations

vatolinalex's activity

upvoted a paper 7 days ago

Training Sparse Mixture Of Experts Text Embedding Models

Paper • 2502.07972 • Published Feb 11 • 5

upvoted a paper 21 days ago

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Paper • 2502.15007 • Published 25 days ago • 163

upvoted a paper 24 days ago

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Paper • 2502.14499 • Published 25 days ago • 180

upvoted 2 collections 3 months ago

rusBEIR-datasets

Collection of datasets used in rusBEIR • 57 items • Updated 10 days ago • 4

Russian Q&A datasets

Datasets collected from scraping Russian question answering websites • 4 items • Updated Mar 15, 2024 • 1

upvoted a paper 7 months ago

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

Paper • 2408.12503 • Published Aug 22, 2024 • 24