Spaces:
Running
on
CPU Upgrade
FLAG - `newsbang/Homer-v0.5-Qwen2.5-7B` MATH contamination
Hi @clefourrier @alozowski and HuggingFace Team,
Paper & Info: https://gair-nlp.github.io/benbench/
Tool: https://github.com/GAIR-NLP/benbench
Besides the 30% improvement on MATH and decreased everything else... we ran contamination tests and the results highlights an increased contamination of MATH tests into the model.
5gram-Homer-v0.5-orgn-MATH-test.jsonl: 38.77333333333333
5gram-Homer-v0.5-orgn-MATH-train.jsonl: 47.16666666666667
vs
5gram-Qwen2.5-7B-Instruct-orgn-MATH-test.jsonl: 37.52666666666667
5gram-Qwen2.5-7B-Instruct-orgn-MATH-train.jsonl: 46.36666666666667
Tested on one of our models, where we know there is no contamination from our side and a clear <8% improvement on MATH AND EVERYTHING ELSE:
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-test.jsonl: 37.42666666666667
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-train.jsonl: 46.053333333333335
Maybe the author @newsbang can explain how the test data of MATH ended into his training session..
Hi
@fblgit
! Thanks for the issue!
Let's wait for the author's response for a week :)
Can you share how you ran your contamination tests so
@alozowski
can take a look and repro? I think it's the first flag of the v2 haha
it may be the first contamination of the v2, but today we have much better tooling to actually dive into it in a simpler manner.
The setup was straight forward:
https://github.com/GAIR-NLP/benbenchpip install requirements.txt
you must have torch, transformers, etc.
run then mkdir -p src/outputs/ngram src/outputs/ppl
change dir to src/
folder and modify ngram_acc.sh
:
#!/bin/bash
MODEL_PATH=$1
MODEL_NAME=$(echo $MODEL_PATH | awk -F/ '{print $NF}')
EVAL=${EVAL:-math}
DEVICE=${2:-"cuda:1"}
echo $MODEL_PATH
echo $MODEL_NAME
echo $EVAL
echo $DEVICE
python ngram_acc.py --dataset_name $EVAL \
--model_path $MODEL_PATH \
--model_name $MODEL_NAME \
--device $DEVICE \
--n 5 \
--model_type base
I ran it from a locally downloaded folder, but after looking the code it may work with pull from the hub directly, then run ./ngram_acc.sh /data/models/model_to_scan cuda:5
You repeat the same with the base, can be run in another GPU in another process without affecting it like ./ngra_acc.sh /data/models/model_base_scan cuda:3
The tool provides an output at the end, you can perform same step on my latest models and u will see no higher contamination tho they increase substantial performance in MATH and other abilities.
Personally, I went thru some of the ngrams samples.. and IMHO, this went thru a DPO of paraphrased MATH dataset including a portion of leaked tests.