saltlux/luxia-21.4b-alignment-v1.2 · contamination results v1.0 vs v1.2 on GSM8K

May 30

•

Contamination test between v1.0 and v1.2 on GSM8k denotes an increase of 29% in contamination

[['finetune', 'saltlux/luxia-21.4b-alignment-v1.2', 'saltlux/luxia-21.4b-alignment-v1.0']]
|| EVALUATING saltlux/luxia-21.4b-alignment-v1.2 ||
|| TESTING gsm8k ||
Running on local URL:  http://127.0.0.1:7861
--------
all data size: 1319
Loading checkpoint shards: 100%|[00:00<00:00, 92365.21it/s]
('result < 0.1, %: ', 0.29)

This means, between v1.0 and v1.2 there is a 29% increased contamination over GSM8K tests. which attributes nearly the majority of the GSM8K improvements between these versions: contamination.
To evaluate the contamination, we used the widely known contamination test.

And just to be clear, the same tests were performed on other evaluations (arc, wino, etc) and their result was 0.0 contamination between V1.0 and V1.2.

deleted

May 30

Average HF Scores

79.2: Mixtral 141b 8x22b Instruct
77.9: Llama 3 70b Instruct
78.1: Luxia 21.4b v1.2

Something doesn't add up. Mixtral 8x22b and Llama 3 70b are absurdly powerful. But HF scores have become nonsensical. There's even Mistral 7bs hitting 77 that perform horribly at everything, including solving simple logic problems, answering basic questions, writing stories without blatant contradictions, performing grammar checks, adhering to poem instructions (e.g. sonnet or limerick), and so on. I've given up trying to make sense of it.

fblgit

May 30

There are a few issues @Phil337

I have the impression that some "mongers" calling themself "AI labs".. are spending most of their time on hijacking a board with toxic crap than actually inventing or training anything that has real value.
Known problems & limitations of rusty lm-eval, chat template..overly abused contamination.. etc..
Mistral and Meta are trying to create higher intelligence and those are foundational models that we still need to learn how to squeeze.