Spaces:
Sleeping
Sleeping
import streamlit as st | |
st.markdown(""" | |
1. Evals Documentation: https://github.com/AaronCWacker/evals/tree/main/docs | |
2. Evals Evals: https://github.com/AaronCWacker/evals/tree/main/evals | |
3. Evals Examples: https://github.com/AaronCWacker/evals/tree/main/examples | |
4. Evals Scripts: https://github.com/AaronCWacker/evals/tree/main/scripts | |
# evals registry data for RLHF task improvement: https://github.com/openai/evals/tree/0e6b55576f0ff4a2a48dec1d7e4ae1e7c6cb8d61/evals/registry/data | |
# Thank you for contributing an eval! ♥️ | |
🚨 Please make sure your PR follows these guidelines, __failure to follow | |
the guidelines below will result in the PR being closed automatically__. | |
Note that even if the criteria are met, that does not guarantee the PR | |
will be merged nor GPT-4 access granted. 🚨 | |
__PLEASE READ THIS__: | |
In order for a PR to be merged, it must fail on GPT-4. We are aware that | |
right now, users do not have access, so you will not be able to tell if | |
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep | |
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, | |
we will likely reject since GPT-4 is already capable of completing the | |
task. | |
We plan to roll out a way for users submitting evals to see the eval | |
performance on GPT-4 soon. Stay tuned! Until then, you will not be able | |
to see the eval performance on GPT-4. We encourage partial PR's with | |
~5-10 example that we can then run the evals on and share the results | |
with you so you know how your eval does with GPT-4 before writing all | |
100 examples. | |
## Eval details 📑 | |
### Eval name | |
which-is-heavier | |
### Eval description | |
This evaluation tests the physical reasoning of GPT by asking which of | |
two quantities is heavier, where the quantities are assigned explicit | |
weights (e.g., "5 kilograms" or "2 pounds") and there is a clear answer | |
(e.g., Q: "Is 5 pounds of tissue paper heavier than 3 pounds of | |
granite?" A: "Yes"). The catch is that, in each example, the heavier | |
quantity is always associated with an item that is generally thought of | |
as being light (e.g., feathers, tissue paper, cotton balls) while the | |
lighter quantity is always associated with an item that is generally | |
thought of as being heavy (e.g., granite, cast iron, plutonium). Humans | |
can easily achieve 100% on this task, but they have to cognitively | |
ignore what comprises the quantities and focus on just their weights. | |
### What makes this a useful eval? | |
ChatGPT 3.5 does a decent job of comparing weights (i.e., "Is 5 pounds | |
greater than 3 pounds?" "Yes") as well as understanding common | |
attributions of "light" and "heavy" to everyday objects like feathers | |
and anvils, but when combining these two into a single comparison, it | |
appears that the model's bias towards these colloquial attributes makes | |
it difficult for the model to perform well. | |
What's interesting is that the errors are not even logically consistent. | |
Consider the following example (taken from the ChatGPT dialogue UX/UI | |
using GPT-4): | |
<img width="715" alt="Screen Shot 2023-03-21 at 4 45 44 PM" | |
src="https://user-images.githubusercontent.com/11773823/226827152-f78c0fb4-1136-433c-9d55-65eb6c6c5407.png"> | |
According to GPT-4: | |
- 15 pounds of hydrogen is heavier than 10 pounds of plutonium | |
- 20 pounds of hydrogen is **not** heavier than 10 pounds of plutonium | |
- 30 pounds of hydrogen is heavier than 10 pounds of plutonium | |
GPT-4 appears to do better than GPT-3.5 in informal testing, but still | |
fails on simple cases and gives inconsistent results. | |
This task exposes how adversarial red herrings can potentially hamper | |
performance in quantitative/physical reasoning tasks such as weight | |
comparison. | |
The other nice aspect of this task is that one can generate these | |
examples programatically. I can submit a dataset of thousands rather | |
than just a few hundred if preferred. | |
## Criteria for a good eval ✅ | |
Below are some of the criteria we look for in a good eval. In general, | |
we are seeking cases where the model does not do a good job despite | |
being capable of generating a good response (note that there are some | |
things large language models cannot do, so those would not make good | |
evals). | |
Your eval should be: | |
- [x] Thematically consistent: The eval should be thematically | |
consistent. We'd like to see a number of prompts all demonstrating some | |
particular failure mode. For example, we can create an eval on cases | |
where the model fails to reason about the physical world. | |
- [x] Contains failures where a human can do the task, but either GPT-4 | |
or GPT-3.5-Turbo could not. | |
- [x] Includes good signal around what is the right behavior. This means | |
either a correct answer for `Basic` evals or the `Fact` Model-graded | |
eval, or an exhaustive rubric for evaluating answers for the `Criteria` | |
Model-graded eval. | |
- [x] Include at least 100 high quality examples (it is okay to only | |
contribute 5-10 meaningful examples and have us test them with GPT-4 | |
before adding all 100) | |
If there is anything else that makes your eval worth including, please | |
document it below. | |
### Unique eval value | |
> Insert what makes your eval high quality that was not mentioned above. | |
(Not required) | |
## Eval structure 🏗️ | |
Your eval should | |
- [x] Check that your data is in `evals/registry/data/{name}` | |
- [x] Check that your yaml is registered at | |
`evals/registry/evals/{name}.yaml` | |
- [x] Ensure you have the right to use the data you submit via this eval | |
(For now, we will only be approving evals that use one of the existing | |
eval classes. You may still write custom eval classes for your own | |
cases, and we may consider merging them in the future.) | |
## Final checklist 👀 | |
### Submission agreement | |
By contributing to Evals, you are agreeing to make your evaluation logic | |
and data under the same MIT license as this repository. You must have | |
adequate rights to upload any data used in an Eval. OpenAI reserves the | |
right to use this data in future service improvements to our product. | |
Contributions to OpenAI Evals will be subject to our usual Usage | |
Policies (https://platform.openai.com/docs/usage-policies). | |
- [x] I agree that my submission will be made available under an MIT | |
license and complies with OpenAI's usage policies. | |
### Email address validation | |
If your submission is accepted, we will be granting GPT-4 access to a | |
limited number of contributors. Access will be given to the email | |
address associated with the merged pull request. | |
- [x] I acknowledge that GPT-4 access will only be granted, if | |
applicable, to the email address used for my merged pull request. | |
### Limited availability acknowledgement | |
We know that you might be excited to contribute to OpenAI's mission, | |
help improve our models, and gain access to GPT-4. However, due to the | |
requirements mentioned above and high volume of submissions, we will not | |
be able to accept all submissions and thus not grant everyone who opens | |
a PR GPT-4 access. We know this is disappointing, but we hope to set the | |
right expectation before you open this PR. | |
- [x] I understand that opening a PR, even if it meets the requirements | |
above, does not guarantee the PR will be merged nor GPT-4 access | |
granted. | |
### Submit eval | |
- [x] I have filled out all required fields in the evals PR form | |
- [x] (Ignore if not submitting code) I have run `pip install | |
pre-commit; pre-commit install` and have verified that `black`, `isort`, | |
and `autoflake` are running when I commit and push | |
Failure to fill out all required fields will result in the PR being | |
closed. | |
### Eval JSON data | |
Since we are using Git LFS, we are asking eval submitters to add in as | |
many Eval Samples (at least 5) from their contribution here: | |
<details> | |
<summary>View evals in JSON</summary> | |
### Eval | |
```jsonl | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 2 kilograms of eggshells | |
heavier than 1 kilogram of gold bars? Answer Yes or No"}], "ideal": | |
"Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 9 pounds of soap foam | |
heavier than 5 pounds of iron chains? Answer Yes or No"}], "ideal": | |
"Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 15 pounds of confetti | |
heavier than 10 pounds of rebar? Answer Yes or No"}], "ideal": "Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 1 pound of steel beams | |
heavier than 2 pounds of dust particles? Answer Yes or No"}], "ideal": | |
"No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 10 pounds of cast iron | |
heavier than 20 pounds of flakes? Answer Yes or No"}], "ideal": "No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 10 kilograms of granite | |
slabs heavier than 20 kilograms of balloons? Answer Yes or No"}], | |
"ideal": "No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 10 kilograms of bricks | |
heavier than 15 kilograms of dust particles? Answer Yes or No"}], | |
"ideal": "No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 1 pound of bronze heavier | |
than 2 pounds of snowflakes? Answer Yes or No"}], "ideal": "No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 1 pound of cast iron | |
heavier than 3 pounds of spider silk? Answer Yes or No"}], "ideal": | |
"No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 12 pounds of hydrogen | |
heavier than 10 pounds of palladium? Answer Yes or No"}], "ideal": | |
"Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 6 kilograms of feathers | |
heavier than 5 kilograms of gold bars? Answer Yes or No"}], "ideal": | |
"Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 7 kilograms of eggshells | |
heavier than 5 kilograms of bronze? Answer Yes or No"}], "ideal": "Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 10 kilograms of lead | |
heavier than 20 kilograms of floating seeds? Answer Yes or No"}], | |
"ideal": "No"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 30 pounds of feathers | |
heavier than 10 pounds of rebar? Answer Yes or No"}], "ideal": "Yes"} | |
{"input": [{"role": "system", "content": "You are a helpful | |
assistant."}, {"role": "user", "content": "Is 2 pounds of pencil | |
shavings heavier than 1 pound of iron chains? Answer Yes or No"}], | |
"ideal": "Yes"} | |
``` | |
</details> | |
""") |