stereoplegic
's Collections
KITAB: Evaluating LLMs on Constraint Satisfaction for Information
Retrieval
Paper
•
2310.15511
•
Published
•
4
HallusionBench: You See What You Think? Or You Think What You See? An
Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5,
and Other Multi-modality Models
Paper
•
2310.14566
•
Published
•
25
SmartPlay : A Benchmark for LLMs as Intelligent Agents
Paper
•
2310.01557
•
Published
•
12
FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation
Paper
•
2310.03214
•
Published
•
18
TiC-CLIP: Continual Training of CLIP Models
Paper
•
2310.16226
•
Published
•
8
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
Completion
Paper
•
2310.11248
•
Published
•
3
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
•
2310.06770
•
Published
•
4
LongBench: A Bilingual, Multitask Benchmark for Long Context
Understanding
Paper
•
2308.14508
•
Published
•
2
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper
•
2310.17631
•
Published
•
33
L-Eval: Instituting Standardized Evaluation for Long Context Language
Models
Paper
•
2307.11088
•
Published
•
4
Evaluating Instruction-Tuned Large Language Models on Code Comprehension
and Generation
Paper
•
2308.01240
•
Published
•
2
ALERT: Adapting Language Models to Reasoning Tasks
Paper
•
2212.08286
•
Published
•
2
AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
Paper
•
2309.06495
•
Published
•
1
RAGAS: Automated Evaluation of Retrieval Augmented Generation
Paper
•
2309.15217
•
Published
•
3
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Paper
•
2310.11440
•
Published
•
15
Benchmarking Large Language Models in Retrieval-Augmented Generation
Paper
•
2309.01431
•
Published
•
1
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning
Optimization
Paper
•
2306.05087
•
Published
•
6
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
Paper
•
2306.04757
•
Published
•
6
PromptBench: Towards Evaluating the Robustness of Large Language Models
on Adversarial Prompts
Paper
•
2306.04528
•
Published
•
3
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance
Paper
•
2306.05443
•
Published
•
3
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on
Class-level Code Generation
Paper
•
2308.01861
•
Published
•
1
Out of the BLEU: how should we assess quality of the Code Generation
models?
Paper
•
2208.03133
•
Published
•
2
CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models
Paper
•
2309.01940
•
Published
•
1
COPEN: Probing Conceptual Knowledge in Pre-trained Language Models
Paper
•
2211.04079
•
Published
•
1
Benchmarking Language Models for Code Syntax Understanding
Paper
•
2210.14473
•
Published
•
1
BigIssue: A Realistic Bug Localization Benchmark
Paper
•
2207.10739
•
Published
•
1
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection
and Code Search
Paper
•
2305.11626
•
Published
•
1
AutoMLBench: A Comprehensive Experimental Evaluation of Automated
Machine Learning Frameworks
Paper
•
2204.08358
•
Published
•
1
Continual evaluation for lifelong learning: Identifying the stability
gap
Paper
•
2205.13452
•
Published
•
1
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
•
2311.07463
•
Published
•
13
RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented
Large Language Models
Paper
•
2308.10633
•
Published
•
1
Fake Alignment: Are LLMs Really Aligned Well?
Paper
•
2311.05915
•
Published
•
2
ToolTalk: Evaluating Tool-Usage in a Conversational Setting
Paper
•
2311.10775
•
Published
•
7
MetaTool Benchmark for Large Language Models: Deciding Whether to Use
Tools and Which to Use
Paper
•
2310.03128
•
Published
•
1
RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit
Paper
•
2306.05212
•
Published
•
1
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
•
2311.12022
•
Published
•
25
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
185
ML-Bench: Large Language Models Leverage Open-source Libraries for
Machine Learning Tasks
Paper
•
2311.09835
•
Published
•
9
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
•
2401.03065
•
Published
•
11
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Paper
•
2310.06266
•
Published
•
1
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in
Large Multimodal Models
Paper
•
2401.13311
•
Published
•
10
Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation
from Text
Paper
•
2308.02357
•
Published
•
1
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models
Paper
•
2305.15074
•
Published
•
1
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper
•
2402.14261
•
Published
•
10
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper
•
2402.12659
•
Published
•
17
Multi-Task Inference: Can Large Language Models Follow Multiple
Instructions at Once?
Paper
•
2402.11597
•
Published
•
1
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of
Large Language Models in Real-world Scenarios
Paper
•
2401.00741
•
Published