Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2305.13711

Tailor-made LLM evaluations: custom evaluations for your LLM

Collection of articles and resources focusing on automatic evaluation for LLM's and their role as unbiased judges in assessing other LLMs' outputs

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 29
Generative Judge for Evaluating Alignment

Paper • 2310.05470 • Published Oct 9, 2023 • 1
Humans or LLMs as the Judge? A Study on Judgement Biases

Paper • 2402.10669 • Published Feb 16
JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 32

Papers: LLM as a Judge

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 32
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 29
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Paper • 2303.16634 • Published Mar 29, 2023 • 3
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Paper • 2310.08491 • Published Oct 12, 2023 • 53

LLM Hallucination Detection Papers

Collection of LLM hallucination and evaluation papers that I've been exploring and implementing. Some of them have my comments and annotated doodles.

Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation

Paper • 2208.05309 • Published Aug 10, 2022 • 1
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models

Paper • 2305.13711 • Published May 23, 2023 • 2
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Paper • 2302.09664 • Published Feb 19, 2023 • 3
BARTScore: Evaluating Generated Text as Text Generation

Paper • 2106.11520 • Published Jun 22, 2021 • 1

Evals & Monitoring

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Paper • 2303.16634 • Published Mar 29, 2023 • 3
miracl/miracl-corpus

Viewer • Updated Jan 5, 2023 • 77.2M • 5.06k • 44
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 29
How is ChatGPT's behavior changing over time?

Paper • 2307.09009 • Published Jul 18, 2023 • 23

Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 32
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Paper • 2310.08491 • Published Oct 12, 2023 • 53
Generative Judge for Evaluating Alignment

Paper • 2310.05470 • Published Oct 9, 2023 • 1
Calibrating LLM-Based Evaluator

Paper • 2309.13308 • Published Sep 23, 2023 • 11

Try out at twllm.com !

about 23 hours ago

yentinglin/Llama-3-Taiwan-70B-Instruct

Text Generation • Updated Aug 23 • 728 • 64
yentinglin/Llama-3-Taiwan-8B-Instruct

Text Generation • Updated Jul 1 • 32.4k • 54
Running

16

🥇

Open Tw Llm Leaderboard
Runtime error

71

💻

Tw Llama Demo

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs