open-llm-leaderboard/open_llm_leaderboard · flagged udkai/Garrulus (academic purposes)

fblgit

Jan 10

•

edited Jan 10

Author denotes the intentional contamination due to some research paper being produced.

https://huggingface.co/udkai/Garrulus/discussions/2

Assisting him to flag it accordingly.
/cc @clefourrier

clefourrier

Open LLM Leaderboard org Jan 11

Took a look at the discussion you linked, and it indeed looks like it was contaminated on WinoGrande - flagging.

clefourrier changed discussion status to closed Jan 11

clefourrier

Open LLM Leaderboard org Jan 11

Thanks for the issue!

hromi

Jan 11

@clefourrier can You please specify what tool You use to operationalize contamination ?

(so that I can do it locally, thus avoiding taking HF space with models which You will subsequently "flag")

Another question: if one optimizes model with data related to metrics A but results go up also in three other metrics B, C, and D, is it still contamination ?

For example, in case of my model, I honestly obtained ARC > 0.73 in Your leaderboard by 2 epochs of "direct preference optimization" with this data: https://huggingface.co/datasets/hromi/winograd_dpo/raw/main/winograd_dpo_modified.json which has nothing to do with ARC.

fblgit

Jan 11

•

edited Jan 11

which base model u used to just pass this as DPO for 2E ?

https://huggingface.co/spaces/Yeyito/llm_contamination_detector

clefourrier

Open LLM Leaderboard org Jan 12

Hi @hromi ,
Contamination detection code is here: https://github.com/swj0419/detect-pretrain-code-contamination and @SaylorTwift is working on a space with the precise setup we're using (but it's been delayed a bit by some up and coming features). In general, if you indicate having fine-tuned on one of our eval sets, then your model is contaminated (obviously) - we've also had issues with some datasets (mostly in the math domain) which were rephrases of our eval datasets (for example, the test set would contain "Is the answer 3" and the rephrase used in a fine-tune would be "Is the answer three", which is also a form of contamination).

If you fine-tune on an unrelated dataset, and your scores get better on our eval tasks, well, good for you, you found an interesting case of transfer learning :) (and it's not contamination).

clefourrier

Open LLM Leaderboard org Jan 12

@fblgit Sorry, I did not understand your question

SimSim93

Jan 14

•

edited Jan 14

@clefourrier
How does one create a flag?

There is a new model using this model for a merge:

https://huggingface.co/dfurman/GarrulusMarcoro-7B-v0.1/discussions

(Sorry I have no clue how this works, neither does the owner of the new model have a clue about it).

fblgit

Jan 14

also have this one udkai/Turdus contaminated.

clefourrier

Open LLM Leaderboard org Jan 15

Hi @SimSim93 and @fblgit ,
The best would be to open a specific discussion on the open llm leaderboard per model you want to flag, which starts with [FLAG] where you explain which model you want to flag and why (that way, we point to the corresponding discussion when creating the manual flag, to make it more readable for users).

hromi

Jan 15

also have this one udkai/Turdus contaminated.

Please read the model card, notably the section with the table and think twice.

Also, I normally avoid so-called ad hominem arguments but in this case I am obliged to state the following:

" @fblgit You seem to be quite prolific in nominating other people models for flagging. But wouldn't it be more profitable for community if You would, in the first place

honestly inform community that fblgit/UNA-TheBeagle-7b-v1 is based on neural-chat which was trained on TigerResearch/tigerbot-gsm-8k-en , that is, the GSM8K dataset

Instead, You close discussions - and making further discussions impossible - when informed about that fact.

That's not really a scientific or community spirit, IMHO, and can be considered a very repulsive behaviour, especially for newcomers who come over to HF with lot of good will."

fblgit

Jan 15

Im not sure what re you talking about, i already clarified with you that is based on the old model and not the one you are mentioning, where this model dont exhibits either a high gsm mark.

In the case of this flag, i only helped the user .. and it is in good faith, as many others feels curious about the inner findings in regards of the surrounding evaluations boosts.

SimSim93

Jan 15

Hi @SimSim93 and @fblgit ,
The best would be to open a specific discussion on the open llm leaderboard per model you want to flag, which starts with [FLAG] where you explain which model you want to flag and why (that way, we point to the corresponding discussion when creating the manual flag, to make it more readable for users).

Okey, I did that:

https://huggingface.co/dfurman/GarrulusMarcoro-7B-v0.1/discussions/1

clefourrier

Open LLM Leaderboard org Jan 15

Hi @SimSim93 , thanks, I'll have a look!
Next time please open it on the open llm leaderboard discussions :)