Spaces:
Running
on
CPU Upgrade
FLAG - `newsbang/Homer-v0.5-Qwen2.5-7B` MATH contamination
Hi @clefourrier @alozowski and HuggingFace Team,
Paper & Info: https://gair-nlp.github.io/benbench/
Tool: https://github.com/GAIR-NLP/benbench
Besides the 30% improvement on MATH and decreased everything else... we ran contamination tests and the results highlights an increased contamination of MATH tests into the model.
5gram-Homer-v0.5-orgn-MATH-test.jsonl: 38.77333333333333
5gram-Homer-v0.5-orgn-MATH-train.jsonl: 47.16666666666667
vs
5gram-Qwen2.5-7B-Instruct-orgn-MATH-test.jsonl: 37.52666666666667
5gram-Qwen2.5-7B-Instruct-orgn-MATH-train.jsonl: 46.36666666666667
Tested on one of our models, where we know there is no contamination from our side and a clear <8% improvement on MATH AND EVERYTHING ELSE:
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-test.jsonl: 37.42666666666667
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-train.jsonl: 46.053333333333335
Maybe the author @newsbang can explain how the test data of MATH ended into his training session..
Hi
@fblgit
! Thanks for the issue!
Let's wait for the author's response for a week :)
Can you share how you ran your contamination tests so
@alozowski
can take a look and repro? I think it's the first flag of the v2 haha
it may be the first contamination of the v2, but today we have much better tooling to actually dive into it in a simpler manner.
The setup was straight forward:
https://github.com/GAIR-NLP/benbenchpip install requirements.txt
you must have torch, transformers, etc.
run then mkdir -p src/outputs/ngram src/outputs/ppl
change dir to src/
folder and modify ngram_acc.sh
:
#!/bin/bash
MODEL_PATH=$1
MODEL_NAME=$(echo $MODEL_PATH | awk -F/ '{print $NF}')
EVAL=${EVAL:-math}
DEVICE=${2:-"cuda:1"}
echo $MODEL_PATH
echo $MODEL_NAME
echo $EVAL
echo $DEVICE
python ngram_acc.py --dataset_name $EVAL \
--model_path $MODEL_PATH \
--model_name $MODEL_NAME \
--device $DEVICE \
--n 5 \
--model_type base
I ran it from a locally downloaded folder, but after looking the code it may work with pull from the hub directly, then run ./ngram_acc.sh /data/models/model_to_scan cuda:5
You repeat the same with the base, can be run in another GPU in another process without affecting it like ./ngra_acc.sh /data/models/model_base_scan cuda:3
The tool provides an output at the end, you can perform same step on my latest models and u will see no higher contamination tho they increase substantial performance in MATH and other abilities.
Personally, I went thru some of the ngrams samples.. and IMHO, this went thru a DPO of paraphrased MATH dataset including a portion of leaked tests.
@clefourrier @fblgit I'm sorry I only saw this issue yesterday. I checked our dataset and suspect that there might be data leakage in OpenMathInstruct-2. For details, see https://huggingface.co/newsbang/Homer-v0.5-Qwen2.5-7B/discussions/1. However, I'm also unsure if [benbench] (https://github.com/GAIR-NLP/benbench) is reasonable. I will further check this dataset.
@clefourrier @fblgit I'm sorry I only saw this issue yesterday. I checked our dataset and suspect that there might be data leakage in OpenMathInstruct-2. For details, see https://huggingface.co/newsbang/Homer-v0.5-Qwen2.5-7B/discussions/1. However, I'm also unsure if [benbench] (https://github.com/GAIR-NLP/benbench) is reasonable. I will further check this dataset.
I matched OpenMathInstruct-2 with MATH-Hard test split and found that some samples indeed have very high similarity.
For example
If we express $3x^2 + 4x + 5$ in the form $a(x - h)^2 + k$, then what is $k$? (OpenMathInstruct-2)
vs
If we express $-2x^2 + 4x + 5$ in the form $a(x - h)^2 + k$, then what is $k$? (MATH-Hard test split)
Let $a,$ $b,$ $c$ be real numbers such that $a^2 + b^2 + c^2 = 9.$ Find the minimum value of\n\\[ab + ac + bc.\\] (OpenMathInstruct-2)
vs
Let $a,$ $b,$ and $c$ be real numbers such that $a^2 + b^2 + c^2 = 1.$ Find the minimum value of\n\\[ab + ac + bc.\\] (MATH-Hard test split)
There could be hundreds or even thousands of similar examples.
............
I will continue to clean my dataset.
Thanks all.
That makes sense! We'll add the FLAG tag to your model and redirect to this discussion so people are aware of this before using your model.
That makes sense! We'll add the FLAG tag to your model and redirect to this discussion so people are aware of this before using your model.
Okay, thank you very much, and I'm sorry for any inconvenience.
@alozowski
thanks for actioning this.
@clefourrier
this one wrapped up without 100 threads of polemics...
@newsbang
thanks for ur transparency and right call on preserving community and leaderboard integrity goals. ping me if u ever need anything.
but..
@clefourrier
how about the dataset, what are the plans to address this to avoid future contamination? does the publisher nvidia
knows whats happening?....
Hi
@fblgit
!
Feel free to open an issue on their repository! It's very likely they are not aware of this issue if this was accidental.
Re plans to address future contamination, we have been experimenting with contamination detection and have so far not found a method which is systematically reliable , we're still exploring. We're really hoping to implement something more robust for the v3.