Where is the test dataset?

#2
by zhiminy - opened

1710626824550.png
1710626907556.png
Checking https://github.com/rohan598/ConTextual/blob/main/data/contextual_all.csv and https://huggingface.co/datasets/ucla-contextual/contextual_all respectively, I only found the full dataset instead of the test dataset. Where is the test dataset after all?
1710626771550.png
1710629752588.png
1710626836954.png
1710629891658.png
1710627806315.png

The ambiguity in the description raises questions about which specific dataset—train, test, or full—the leaderboard employs to present its evaluation results. To ensure clarity and facilitate accurate interpretation, it would be helpful to explicitly state the dataset used for evaluation in the leaderboard's documentation.

ucla-contextual org

Hi @zhiminy ,
Apologies for the confusion,
We have two leaderboards (val) and (test)

For the val leaderboard, please use contextual_val.csv

For the test leaderboard, please use contextual_all.csv

Note: This is only an evaluation benchmark so there are no training samples. The train in this image, is a naming convention of the platform (will look into how we change it)

Val leaderboard is to give you a quick idea about how well your model might perform on the overall dataset and how well it understand these contextual tasks on text-rich images

Test leaderboard is a final evaluation of the performance of your model on all the samples of this dataset.

To prevent over-engineering of the benchmark, we keep release only part of the image, instruction, response triplets (100 out of 506) for validation, while keeping the remaining hidden.

Hi @zhiminy ,
Apologies for the confusion,
We have two leaderboards (val) and (test)

For the val leaderboard, please use contextual_val.csv

For the test leaderboard, please use contextual_all.csv

Note: This is only an evaluation benchmark so there are no training samples. The train in this image, is a naming convention of the platform (will look into how we change it)

Val leaderboard is to give you a quick idea about how well your model might perform on the overall dataset and how well it understand these contextual tasks on text-rich images

Test leaderboard is a final evaluation of the performance of your model on all the samples of this dataset.

To prevent over-engineering of the benchmark, we keep release only part of the image, instruction, response triplets (100 out of 506) for validation, while keeping the remaining hidden.

Thanks for your explanation! Considering that "all" actually refers to "test," it would be beneficial to standardize the terminology to avoid any potential confusion among users.

ucla-contextual org

Thanks for spotting this and the suggestion. We have updated all ConTextual resources for consistency. In case you still find something mis aligned, feel free to reopen this issue!

rohan598 changed discussion status to closed

Sign up or log in to comment