Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1020

FLAG - `newsbang/Homer-v0.5-Qwen2.5-7B` MATH contamination

#1022

by fblgit - opened about 11 hours ago

Discussion

fblgit

about 11 hours ago

Hi @clefourrier @alozowski and HuggingFace Team,

Paper & Info: https://gair-nlp.github.io/benbench/
Tool: https://github.com/GAIR-NLP/benbench

Besides the 30% improvement on MATH and decreased everything else... we ran contamination tests and the results highlights an increased contamination of MATH tests into the model.

5gram-Homer-v0.5-orgn-MATH-test.jsonl: 38.77333333333333
5gram-Homer-v0.5-orgn-MATH-train.jsonl: 47.16666666666667

5gram-Qwen2.5-7B-Instruct-orgn-MATH-test.jsonl: 37.52666666666667
5gram-Qwen2.5-7B-Instruct-orgn-MATH-train.jsonl: 46.36666666666667

Tested on one of our models, where we know there is no contamination from our side and a clear <8% improvement on MATH AND EVERYTHING ELSE:

5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-test.jsonl: 37.42666666666667
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-train.jsonl: 46.053333333333335

Maybe the author @newsbang can explain how the test data of MATH ended into his training session..

clefourrier

Open LLM Leaderboard org about 4 hours ago

Hi @fblgit ! Thanks for the issue!
Let's wait for the author's response for a week :)
Can you share how you ran your contamination tests so @alozowski can take a look and repro? I think it's the first flag of the v2 haha

fblgit

about 3 hours ago

it may be the first contamination of the v2, but today we have much better tooling to actually dive into it in a simpler manner.

The setup was straight forward:
https://github.com/GAIR-NLP/benbench
pip install requirements.txt
you must have torch, transformers, etc.
run then mkdir -p src/outputs/ngram src/outputs/ppl

change dir to src/ folder and modify ngram_acc.sh:

#!/bin/bash
MODEL_PATH=$1
MODEL_NAME=$(echo $MODEL_PATH | awk -F/ '{print $NF}')
EVAL=${EVAL:-math}
DEVICE=${2:-"cuda:1"}
echo $MODEL_PATH
echo $MODEL_NAME
echo $EVAL
echo $DEVICE
python ngram_acc.py --dataset_name $EVAL \
    --model_path $MODEL_PATH \
    --model_name $MODEL_NAME \
    --device $DEVICE \
    --n 5 \
    --model_type base

I ran it from a locally downloaded folder, but after looking the code it may work with pull from the hub directly, then run ./ngram_acc.sh /data/models/model_to_scan cuda:5
You repeat the same with the base, can be run in another GPU in another process without affecting it like ./ngra_acc.sh /data/models/model_base_scan cuda:3
The tool provides an output at the end, you can perform same step on my latest models and u will see no higher contamination tho they increase substantial performance in MATH and other abilities.

Personally, I went thru some of the ngrams samples.. and IMHO, this went thru a DPO of paraphrased MATH dataset including a portion of leaked tests.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment