contamination results v1.0 vs v1.2 on GSM8K

#1
by fblgit - opened

Contamination test between v1.0 and v1.2 on GSM8k denotes an increase of 29% in contamination

[['finetune', 'saltlux/luxia-21.4b-alignment-v1.2', 'saltlux/luxia-21.4b-alignment-v1.0']]
|| EVALUATING saltlux/luxia-21.4b-alignment-v1.2 ||
|| TESTING gsm8k ||
Running on local URL:  http://127.0.0.1:7861
--------
all data size: 1319
Loading checkpoint shards: 100%|[00:00<00:00, 92365.21it/s]
('result < 0.1, %: ', 0.29)

This means, between v1.0 and v1.2 there is a 29% increased contamination over GSM8K tests. which attributes nearly the majority of the GSM8K improvements between these versions: contamination.
To evaluate the contamination, we used the widely known contamination test.

And just to be clear, the same tests were performed on other evaluations (arc, wino, etc) and their result was 0.0 contamination between V1.0 and V1.2.

deleted

Average HF Scores

  • 79.2: Mixtral 141b 8x22b Instruct
  • 77.9: Llama 3 70b Instruct
  • 78.1: Luxia 21.4b v1.2

Something doesn't add up. Mixtral 8x22b and Llama 3 70b are absurdly powerful. But HF scores have become nonsensical. There's even Mistral 7bs hitting 77 that perform horribly at everything, including solving simple logic problems, answering basic questions, writing stories without blatant contradictions, performing grammar checks, adhering to poem instructions (e.g. sonnet or limerick), and so on. I've given up trying to make sense of it.

There are a few issues @Phil337

  1. I have the impression that some "mongers" calling themself "AI labs".. are spending most of their time on hijacking a board with toxic crap than actually inventing or training anything that has real value.
  2. Known problems & limitations of rusty lm-eval, chat template..overly abused contamination.. etc..
  3. Mistral and Meta are trying to create higher intelligence and those are foundational models that we still need to learn how to squeeze.

Sign up or log in to comment