data_only_hallucination_leaderboard

Runtime error

App Files Files Community

aryopg commited on Jan 7

Commit

a911aee

•

1 Parent(s): b7d562b

first draft: add tasks background info

Browse files

Files changed (1) hide show

src/display/about.py +31 -0

src/display/about.py CHANGED Viewed

@@ -10,6 +10,37 @@ The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation
 """
 LLM_BENCHMARKS_TEXT = f"""
 XXX
 """

 """
 LLM_BENCHMARKS_TEXT = f"""
+# Context
+As large language models (LLMs) get better at creating believable texts, addressing hallucinations in LLMs becomes increasingly important. In this exciting time where numerous LLMs released every week, it can be challenging to identify the leading model, particularly in terms of their reliability against hallucination. This leaderboard aims to provide a platform where anyone can evaluate the latest LLMs at any time.
+# How it works
+📈 We evaluate the models on 11 hallucination benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
+- <a href="https://aclanthology.org/P19-1612/" target="_blank"> NQ Open </a> .
+- <a href="https://aclanthology.org/P17-1147/" target="_blank"> TriviaQA </a> .
+- <a href="https://aclanthology.org/2022.acl-long.229/" target="_blank"> TruthfulQA MC1 </a> - a benchmark to measure whether a language model is truthful in generating answers to questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. **MC1 denotes that there is a single correct label**.
+- <a href="https://aclanthology.org/2022.acl-long.229/" target="_blank"> TruthfulQA MC2 </a> - a benchmark to measure whether a language model is truthful in generating answers to questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. **MC2 denotes that there can be multiple correct labels**.
+- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval QA </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. **QA denotes the question answering task**.
+- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval Summ </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. **Summ denotes the summarisation task**.
+- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval Dial </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. **Dial denotes the knowledge-grounded dialogue task**.
+- <a href="https://aclanthology.org/2020.acl-main.173/" target="_blank"> XSum </a> - a dataset of BBC news articles paired with their single-sentence summaries to evaluate the output of abstractive summarization using a language model.
+- <a href="https://arxiv.org/abs/1704.04368" target="_blank"> CNN/DM </a> - a dataset of CNN and Daily Mail articles paired with their summaries.
+- <a href="https://github.com/inverse-scaling/prize/tree/main" target="_blank"> MemoTrap </a> - a dataset to investigate whether language models could fall into memorization traps. It comprises instructions that prompt the language model to complete a well-known proverb with an ending word that deviates from the commonly used ending (e.g., Write a quote that ends in the word “early”: Better late than ).
+- <a href="https://arxiv.org/abs/2311.07911v1" target="_blank"> IFEval </a> a dataset to evaluate instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks".
+For all these evaluations, a higher score is a better score.
+# Details and logs
+You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
+# Reproducibility
+Hyperparameters: XXX
+Device(s): XXX
+Metrics: XXX
+"""
+FAQ_TEXT = """
+---------------------------
+# FAQ
 XXX
 """