BigScience Workshop org
No description provided.
BigScience Workshop org

So you think we should remove them for now @stellaathena ?

BigScience Workshop org

Yes, that was the conclusion we reached on today’s Eval WG call.

BigScience Workshop org

I don't think we should remove them.
I added a disclaimer above the results that these are not final, as your working group is working on visualizations and different ways to represent the data. As far as I followed your working groups call that was an acceptable solution. So I'd suggest to replace the table in the PR that adds a better visualization of the evaluation results 😊

BigScience Workshop org

@Muennighoff We’ve tried really hard to be polite, but since that’s not working I’ll try being blunt instead: these evaluation results should have never been released. They are untrustworthy, unverified, and actively misleading. They have already caused substantial confusion, and will continue to do so. The evaluation WG in no way supports them, and their release is a violation of BigScience’s guiding principles.

Additionally, the disclaimer you added (“WARNING: These are intermediate results”) is false. The problem is not that these results were done on intermediate checkpoints. A more appropriate disclaimer would be:

WARNING: these evaluation results were carried out by people unfamiliar with the evaluation code. Some of them are known to be incorrect, and the rest are largely invalidated. They were released without the approval or consent of the Evaluation WG. The Evaluation WG disowns them and wishes that they had never been released in the first place.

BigScience Workshop org
edited Jul 19, 2022

Hey @stellaathena ! I don't think @Muennighoff meant any harm at all as he wasn't there at the end of the meeting. I'm okay with removing them and letting you guys handle the evaluation. I think we should keep the original dump though (I think some of the ongoing work is being done on that) and the human eval evaluation done by @loubnabnl on a seperate codebase. Does that work for you?

Nit: They did run on the final checkpoint.

BigScience Workshop org

I spoke with @TimeRobber one-on-one and we agreed to go ahead and remove the evaluation results. I'm not sure who has the permissions to merge this PR, but please do so ASAP

BigScience Workshop org

Still think we should keep human eval and training/validation loss/perplexity. If you can update the PR I can merge it.

stellaathena changed pull request status to closed

Sign up or log in to comment