Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing Paper • 2404.12253 • Published Apr 18 • 51
How Far Can We Go with Practical Function-Level Program Repair? Paper • 2404.12833 • Published Apr 19 • 6
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published Apr 29 • 67
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2 • 106
From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries Paper • 2406.12824 • Published 11 days ago • 20
Tokenization Falling Short: The Curse of Tokenization Paper • 2406.11687 • Published 12 days ago • 13
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content Paper • 2406.11811 • Published 12 days ago • 14
GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks Paper • 2406.12925 • Published 15 days ago • 17
HARE: HumAn pRiors, a key to small language model Efficiency Paper • 2406.11410 • Published 13 days ago • 37
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Paper • 2406.12624 • Published 11 days ago • 34
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs Paper • 2406.15319 • Published 8 days ago • 53
Octo-planner: On-device Language Model for Planner-Action Agents Paper • 2406.18082 • Published 4 days ago • 42
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Paper • 2406.18495 • Published 3 days ago • 10