The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs Paper • 2503.20000 • Published 2 days ago • 1
BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction Paper • 2503.19658 • Published 2 days ago • 1
SkyLadder: Better and Faster Pretraining via Context Window Scheduling Paper • 2503.15450 • Published 8 days ago • 11
InsectSet459: an open dataset of insect sounds for bioacoustic machine learning Paper • 2503.15074 • Published 8 days ago • 1
Brazilian legal datasets ⚖️ Collection A collection of data extracted from the courts of Brazil (and others legal websites) • 31 items • Updated 8 days ago • 2
Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru Paper • 2503.07587 • Published 17 days ago • 10
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published 17 days ago • 95
JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments Paper • 2503.08379 • Published 16 days ago • 2
EuroBERT: Scaling Multilingual Encoders for European Languages Paper • 2503.05500 • Published 20 days ago • 75
view article Article HuggingFace, IISc partner to supercharge model building on India's diverse languages 29 days ago • 17
rank1 Collection rank1 is the first test-time compute reasoning model in IR • 15 items • Updated 28 days ago • 3
OWLS: Scaling Laws for Speech Recognition and Translation Collection 🦉 A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate. • 7 items • Updated 17 days ago • 4
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models Paper • 2502.15964 • Published Feb 21 • 1
"Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts Paper • 2502.16839 • Published Feb 24 • 1