pavlichenko's picture
Update app.py
87822a8
raw
history blame contribute delete
No virus
15.7 kB
import streamlit as st
import requests
from collections import defaultdict
import pandas as pd
import plotly.graph_objects as go
header = """Toloka compared and ranked LLM output in multiple categories, using Guanaco 13B as the baseline.
We used human evaluation to rate model responses to real prompts."""
description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we ask [Toloka](https://toloka.ai/) domain experts to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "Total" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
### 📊 The evaluation method
#### 🖊 Stage 1: Prompt collection
We collected our own dataset of organic prompts for LLM evaluation.
The alternative is to use open-source prompts, but they are not reliable enough for high-quality evaluation. Using open-source datasets can be restrictive for several reasons:
1. Many open-source prompts are too generic and do not reflect the needs of a business looking to implement an LLM.
2. The range of tasks the open-source prompts cover might be broad but the distribution is skewed towards certain tasks that are not necessarily the most relevant for business applications.
3. It is virtually impossible to guarantee that the dataset was not leaked and the open-source prompts were not included in the training data of the existing LLMs.
To mitigate these issues, we collected organic prompts sent to ChatGPT (some were submitted by Toloka employees, and some we found on the internet, but all of them were from real conversations with ChatGPT). These prompts are the key to accurate evaluation — **we can be certain that the prompts represent real-world use cases, and they were not used in any LLM training sets.** We store the dataset securely and reserve it solely for use in this particular evaluation.
After collecting the prompts, we manually classified them by category and got the following distribution:"""
# * Brainstorming: 15.48%
# * Chat: 1.59%
# * Classification: 0.2%
# * Closed QA: 3.77%
# * Extraction: 0.6%
# * Generation: 38.29%
# * Open QA: 32.94%
# * Rewrite: 5.16%
# * Summarization: 1.98%
fig = go.Figure(
data=[go.Bar(y=[38.29, 32.94, 15.48, 5.16, 3.77, 1.98, 1.59, 0.6, 0.2], x=["Generation", "Open QA", "Brainstorming", "Rewrite", "Closed QA", "Summarization", "Chat", "Extraction", "Classification"])],
)
fig.update_layout(yaxis_title="% of prompts")
description2 = """We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
#### 🧠 Stage 2: Human evaluation
Human evaluation of prompts was conducted by [Toloka’s domain experts](https://toloka.ai/blog/ai-tutors/).
Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the [harmlessness, truthfulness, and helpfulness principles](https://arxiv.org/pdf/2203.02155.pdf).
In other words, each model was compared to the same baseline model, rather than comparing each model to every other competitor model. Then we calculated the percentage of prompts where humans preferred the tested model’s output over the baseline model’s output (this is called the model’s win rate). The leaderboard shows results in each category, as well as the average score overall for each of the tested models.
Most importantly, we ensured the accuracy of human judgments by using advanced quality control techniques:
- Annotator onboarding with rigorous qualification tests to certify experts and check their performance on evaluation tasks.
- Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
- Monitoring individual accuracy by comparing each expert’s results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.
#### 👉 Ready to compare LLMs?
Find your AI application’s use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards ([Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?tab=evaluation), [LMSYS](https://leaderboard.lmsys.org/), or others) for the big picture before you pick a model and start experimenting.
If you’re interested in comparing more LLMs using our experts, or you need reliable evaluation of your model, we have the tools you need.
Reach out to our team to learn how Toloka can help you achieve the quality insights you’re looking for.
"""
pretty_category_names = {
"all": "Total",
"brainstorming": "Brainstorming",
"closed_qa": "Closed QA",
"generation": "Generation",
"open_qa": "Open QA",
"rewrite": "Rewrite",
}
pretty_model_names = {
"gpt-4": "GPT-4",
"WizardLM/WizardLM-13B-V1.2": "WizardLM 13B V1.2",
"meta-llama/Llama-2-70b-chat-hf": "LLaMA 2 70B Chat",
"gpt-3.5-turbo": "GPT-3.5 Turbo",
"lmsys/vicuna-33b-v1.3": "Vicuna 33B V1.3",
"timdettmers/guanaco-13b": "Guanaco 13B",
}
reference_model_name = "timdettmers/guanaco-13b"
leaderboard_results = requests.get("https://llmleaderboard.blob.core.windows.net/llmleaderboard/evaluation_resuls.json").json()
categories = list(leaderboard_results.keys())
pretty_categories = [pretty_category_names[category] for category in categories if category in pretty_category_names]
categories.sort()
models = set()
model_ratings = defaultdict(dict)
for category in categories:
for entry in leaderboard_results[category]:
model = entry['model']
models.add(model)
model_ratings[model][category] = entry['rating']
table = []
for model in models:
row = [model]
for category in categories:
if category not in pretty_category_names:
continue
if category not in model_ratings[model]:
row.append(0.0)
else:
row.append(model_ratings[model][category] * 100)
table.append(row)
table = pd.DataFrame(table, columns=['Model'] + pretty_categories)
table = table.sort_values(by=['Total'], ascending=False)
table = table.head(5)
# Add row with reference model
row = [reference_model_name] + [50.0] * len(pretty_categories)
table = pd.concat([table, pd.DataFrame([pd.Series(row, index=table.columns)])], ignore_index=True)
table = table.sort_values(by=['Total'], ascending=False)
table.index = ["🥇 1", "🥈 2", "🥉 3"] + list(range(4, len(table) + 1))
for category in pretty_category_names.values():
table[category] = table[category].map('{:,.2f}%'.format)
avg_token_counts = requests.get("https://llmleaderboard.blob.core.windows.net/llmleaderboard/token_count.json").json()
table['Avg. Response Length (in characters)'] = [int(avg_token_counts[model]) if model != reference_model_name else int(avg_token_counts["TheBloke/guanaco-13B-HF"]) for model in table['Model']]
table['HF Hub Link'] = [f"https://huggingface.co/{model}" if "/" in model else "" for model in table["Model"]]
table["Model"] = [pretty_model_names[model] if model in pretty_model_names else model for model in table["Model"]]
st.set_page_config(layout="wide")
st.title('🏆 Toloka LLM Leaderboard for Real-World Evaluation')
st.markdown(header)
st.dataframe(
table,
column_config={
"HF Hub Link": st.column_config.LinkColumn(
"HF Hub Link",
help="HF Hub Link",
)
}
)
st.markdown(description)
st.plotly_chart(fig, theme="streamlit")
st.markdown(description2)
st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
prompt_examples = """
### 🔍 Prompt Examples
| Prompt | Model | Output |
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| What does 'burning the midnight oil' mean? | GPT-4 | “Burning the midnight oil" is an idiomatic expression that means working late into the night. It originates from the past when people used oil lamps for light to work after the sun went down. So, if someone is "burning the midnight oil," they are up late working, often on a project or task that requires extra time. |
| | WizardLM | "Burning the midnight oil" is an idiom that means working late into the night, often until midnight or later, to complete tasks or meet deadlines. It is often used to describe an intense period of focused, dedicated work, often to the point of exhaustion. The phrase suggests that the person is spending so much time and energy on their work that they are essentially "burning" through their hours of rest and sleep, much like a candle burns through its wax. |
Here WizardLM misunderstood the prompt:
| Prompt | Model | Output |
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
| What is 5x5? | GPT-4 | 5x5 is 25. |
| | WizardLM | 5x5 refers to a 5 by 5 grid or a 5-dimensional space. In mathematics, a 5x5 grid is a square grid with five rows and five columns. Each cell in the grid is identified by a pair of coordinates, (i, j), where i ranges from 1 to 5 for the row and j ranges from 1 to 5 for the column. In a 5x5 grid, there are 25 cells in total. Each cell can be filled with a specific value or content, such as a number, a letter, or a symbol. For example, a 5x5 grid can be used to create a puzzle where you have to fill in the missing numbers or letters in a specific pattern. In higher mathematics, a 5-dimensional space is a more abstract concept used to represent complex geometric structures. It's often used in fields like physics and computer science to model complex systems and relationships between different variables. In everyday life, a 5x5 grid can be used to represent a calendar with five columns for days of the week and five rows for different time slots in a day. |
"""
st.markdown(prompt_examples)