stem-leaderboard / content.py
shizue's picture
update abs
2b33fcb
raw
history blame
3.86 kB
TITLE = """<h1 align="center" id="space-title">STEM Leaderboard</h1>"""
INTRODUCTION_TEXT = """
<p align="center">
πŸ“ƒ <a href="https://arxiv.org/abs/2402.17205" target="_blank">[Paper]</a> β€’ πŸ’» <a href="https://github.com/stemdataset/STEM" target="_blank">[Github]</a> β€’ πŸ€— <a href="https://huggingface.co/datasets/stemdataset/STEM" target="_blank">[Dataset]</a> β€’ πŸ† <a href="https://huggingface.co/spaces/stemdataset/stem-leaderboard" target="_blank">[Leaderboard]</a> β€’ πŸ“½ <a href="https://github.com/stemdataset/STEM/blob/main/assets/STEM-Slides.pdf" target="_blank">[Slides]</a> β€’ πŸ“‹ <a href="https://github.com/stemdataset/STEM/blob/main/assets/poster.pdf" target="_blank">[Poster]</a>
</p>
## Overview
This dataset is proposed in the ICLR 2024 paper: [Measuring Vision-Language STEM Skills of Neural Models](https://arxiv.org/abs/2402.17205). We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.
## Resources
- **Code:** https://github.com/stemdataset/STEM
- **Paper:** https://arxiv.org/abs/2402.17205
- **Dataset:** https://huggingface.co/datasets/stemdataset/STEM
- **Leaderboard:** https://huggingface.co/spaces/stemdataset/stem-leaderboard
## How to Submit
Feel free to follow our code to format your predictions on our test set. Scores are accuracy. In our paper, we also provide exam scores from millions of elementary students on the dataset.
We expect each row of the submission file to contain the predicted `answer_idx` for each question of the test set. Note that the order of the predictions should follow the order of the questions of the dataset. Below is an example submission format.
```text
2
0
0
1
...
```
## Citation
Copy the following snippet to cite these results
```bibtex
@inproceedings{shen2024measuring,
title={Measuring Vision-Language STEM Skills of Neural Models},
author={Shen, Jianhao and Yuan, Ye and Mirzoyan, Srbuhi and Zhang, Ming and Wang, Chenguang},
booktitle={ICLR},
year={2024}
}
```
## Leaderboard
"""
def format_error(msg):
return f"<p style='color: red; font-size: 20px; text-align: center;'>{msg}</p>"
def format_warning(msg):
return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"
def format_log(msg):
return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"
def model_hyperlink(link, model_name):
return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'